Open In Colab

%pip install tensorflow
%pip install lets-plot
Requirement already satisfied: tensorflow in /usr/local/lib/python3.12/dist-packages (2.19.0)
Requirement already satisfied: absl-py>=1.0.0 in /usr/local/lib/python3.12/dist-packages (from tensorflow) (1.4.0)
Requirement already satisfied: astunparse>=1.6.0 in /usr/local/lib/python3.12/dist-packages (from tensorflow) (1.6.3)
Requirement already satisfied: flatbuffers>=24.3.25 in /usr/local/lib/python3.12/dist-packages (from tensorflow) (25.12.19)
Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in /usr/local/lib/python3.12/dist-packages (from tensorflow) (0.7.0)
Requirement already satisfied: google-pasta>=0.1.1 in /usr/local/lib/python3.12/dist-packages (from tensorflow) (0.2.0)
Requirement already satisfied: libclang>=13.0.0 in /usr/local/lib/python3.12/dist-packages (from tensorflow) (18.1.1)
Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.12/dist-packages (from tensorflow) (3.4.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.12/dist-packages (from tensorflow) (26.0)
Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0dev,>=3.20.3 in /usr/local/lib/python3.12/dist-packages (from tensorflow) (5.29.6)
Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.12/dist-packages (from tensorflow) (2.32.4)
Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from tensorflow) (75.2.0)
Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.12/dist-packages (from tensorflow) (1.17.0)
Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from tensorflow) (3.3.0)
Requirement already satisfied: typing-extensions>=3.6.6 in /usr/local/lib/python3.12/dist-packages (from tensorflow) (4.15.0)
Requirement already satisfied: wrapt>=1.11.0 in /usr/local/lib/python3.12/dist-packages (from tensorflow) (2.1.1)
Requirement already satisfied: grpcio<2.0,>=1.24.3 in /usr/local/lib/python3.12/dist-packages (from tensorflow) (1.78.1)
Requirement already satisfied: tensorboard~=2.19.0 in /usr/local/lib/python3.12/dist-packages (from tensorflow) (2.19.0)
Requirement already satisfied: keras>=3.5.0 in /usr/local/lib/python3.12/dist-packages (from tensorflow) (3.10.0)
Requirement already satisfied: numpy<2.2.0,>=1.26.0 in /usr/local/lib/python3.12/dist-packages (from tensorflow) (2.0.2)
Requirement already satisfied: h5py>=3.11.0 in /usr/local/lib/python3.12/dist-packages (from tensorflow) (3.15.1)
Requirement already satisfied: ml-dtypes<1.0.0,>=0.5.1 in /usr/local/lib/python3.12/dist-packages (from tensorflow) (0.5.4)
Requirement already satisfied: wheel<1.0,>=0.23.0 in /usr/local/lib/python3.12/dist-packages (from astunparse>=1.6.0->tensorflow) (0.46.3)
Requirement already satisfied: rich in /usr/local/lib/python3.12/dist-packages (from keras>=3.5.0->tensorflow) (13.9.4)
Requirement already satisfied: namex in /usr/local/lib/python3.12/dist-packages (from keras>=3.5.0->tensorflow) (0.1.0)
Requirement already satisfied: optree in /usr/local/lib/python3.12/dist-packages (from keras>=3.5.0->tensorflow) (0.19.0)
Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests<3,>=2.21.0->tensorflow) (3.4.4)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests<3,>=2.21.0->tensorflow) (3.11)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests<3,>=2.21.0->tensorflow) (2.5.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/dist-packages (from requests<3,>=2.21.0->tensorflow) (2026.1.4)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.12/dist-packages (from tensorboard~=2.19.0->tensorflow) (3.10.2)
Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /usr/local/lib/python3.12/dist-packages (from tensorboard~=2.19.0->tensorflow) (0.7.2)
Requirement already satisfied: werkzeug>=1.0.1 in /usr/local/lib/python3.12/dist-packages (from tensorboard~=2.19.0->tensorflow) (3.1.6)
Requirement already satisfied: markupsafe>=2.1.1 in /usr/local/lib/python3.12/dist-packages (from werkzeug>=1.0.1->tensorboard~=2.19.0->tensorflow) (3.0.3)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.12/dist-packages (from rich->keras>=3.5.0->tensorflow) (4.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/dist-packages (from rich->keras>=3.5.0->tensorflow) (2.19.2)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.12/dist-packages (from markdown-it-py>=2.2.0->rich->keras>=3.5.0->tensorflow) (0.1.2)
Requirement already satisfied: lets-plot in /usr/local/lib/python3.12/dist-packages (4.8.2)
Requirement already satisfied: pypng in /usr/local/lib/python3.12/dist-packages (from lets-plot) (0.20220715.0)
Requirement already satisfied: palettable in /usr/local/lib/python3.12/dist-packages (from lets-plot) (3.3.3)
Requirement already satisfied: pillow in /usr/local/lib/python3.12/dist-packages (from lets-plot) (11.3.0)

Exploration

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from lets_plot import *
#
#%%
#Model testing first
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dropout, Flatten, Dense
from sklearn.metrics import root_mean_squared_error, r2_score
import numpy as np
bikes = pd.read_csv("https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/bikes.csv")
bikes.tail()
dteday hr casual registered temp_c feels_like_c hum windspeed weathersit season holiday workingday
112470 10/31/2023 19.0 241 594 10.2 10.2 0.4516 8.4 2 4 0 1
112471 10/31/2023 20.0 171 450 8.9 7.5 0.5268 9.3 1 4 0 1
112472 10/31/2023 21.0 176 402 7.7 6.4 0.5756 7.8 1 4 0 1
112473 10/31/2023 22.0 106 257 8.0 8.0 0.5604 0.0 1 4 0 1
112474 10/31/2023 23.0 78 135 6.5 6.5 0.6401 4.5 1 4 0 1

It is important to be aware of every kind of events (including political) because in those days were the maximum amount of uses of bikes

bikes.describe()
hr casual registered temp_c feels_like_c hum windspeed weathersit season holiday workingday
count 112475.000000 112475.000000 112475.000000 112475.000000 112475.000000 112475.000000 112475.000000 112475.000000 112475.000000 112475.000000 112475.000000
mean 11.501098 90.434612 249.193625 15.376487 14.659325 0.636624 13.100614 1.405441 2.495799 0.030300 0.684312
std 6.921864 128.655621 258.267544 9.749467 11.428324 0.190328 7.857600 0.683450 1.101152 0.171412 0.464791
min 0.000000 0.000000 0.000000 -14.700000 -24.000000 0.088900 0.000000 1.000000 1.000000 0.000000 0.000000
25% 6.000000 7.000000 48.000000 7.600000 5.400000 0.484100 7.700000 1.000000 2.000000 0.000000 0.000000
50% 12.000000 36.000000 180.000000 16.000000 16.000000 0.640900 12.200000 1.000000 2.000000 0.000000 1.000000
75% 18.000000 122.000000 360.000000 23.500000 23.500000 0.798800 17.500000 2.000000 3.000000 0.000000 1.000000
max 23.000000 1244.000000 1702.000000 40.500000 49.600000 1.000000 69.800000 4.000000 4.000000 1.000000 1.000000
bikes[bikes['casual']>1000]
dteday hr casual registered temp_c feels_like_c hum windspeed weathersit season holiday workingday
54606 3/25/2017 13.0 1034 582 20.8 20.8 0.5386 16.2 2 2 0 0
54608 3/25/2017 15.0 1030 595 25.0 25.0 0.4324 9.2 1 2 0 0
55112 4/15/2017 15.0 1037 558 22.9 22.9 0.5674 12.8 1 2 0 0
63847 4/14/2018 15.0 1050 746 28.8 28.2 0.3711 30.0 1 2 0 0
82661 6/6/2020 15.0 1005 409 31.6 34.2 0.5215 16.3 1 2 0 0
82662 6/6/2020 16.0 1244 501 32.0 34.2 0.4895 20.7 1 2 0 0
82663 6/6/2020 17.0 1061 416 32.1 33.5 0.4559 26.7 1 2 0 0
86357 11/7/2020 15.0 1008 550 22.9 22.9 0.3632 10.9 1 4 0 0
89715 3/27/2021 14.0 1008 457 20.8 20.8 0.4441 14.3 1 2 0 0
100854 7/4/2022 18.0 1029 418 30.5 29.6 0.3409 14.3 1 3 1 0
100855 7/4/2022 19.0 1083 445 29.2 28.6 0.3702 12.8 1 3 1 0
100856 7/4/2022 20.0 1048 424 28.0 27.8 0.4100 10.4 1 3 1 0
100857 7/4/2022 21.0 1042 487 26.0 26.0 0.4670 13.8 1 3 1 0
107208 3/26/2023 13.0 1123 651 17.0 17.0 0.2744 21.8 1 2 0 0
107209 3/26/2023 14.0 1141 639 18.6 18.6 0.2493 21.2 1 2 0 0
107210 3/26/2023 15.0 1206 660 19.2 19.2 0.2231 15.8 1 2 0 0
107211 3/26/2023 16.0 1038 624 20.5 20.5 0.2157 12.8 1 2 0 0
107212 3/26/2023 17.0 1003 593 20.6 20.6 0.2061 12.4 1 2 0 0
bikes["datetime"] = pd.to_datetime(bikes["dteday"])
bikes["month"] = bikes["datetime"].dt.month
bikes['day'] = bikes['datetime'].dt.month
bikes["day_of_week"] = bikes["datetime"].dt.dayofweek
bikes["year"] = bikes["datetime"].dt.year
bikes = bikes.drop(columns = ['dteday'])
bikes['total'] = bikes['casual'] + bikes['registered']
bikes[(bikes['registered']>1500) | (bikes['casual'] > 1500)].groupby(['day','month','year']).agg(total_casual = ('casual','sum'), total_registered = ('registered','sum'))
total_casual total_registered
day month year
4 4 2017 381 1607
2019 787 6154
5 5 2017 582 3064
2018 568 3115
6 6 2016 1285 9210
2017 640 3082
2018 1704 7619
2019 375 3105
7 7 2018 338 1527
8 8 2017 1076 4547
2018 930 4686
2019 99 1517
9 9 2017 2645 15475
2018 712 6151
2019 502 9210
10 10 2016 537 3056
2017 1443 9196
2018 1316 12458
2019 598 9505
2023 2053 4569
bikes[(bikes['casual'] > 1500) | (bikes['registered'] > 1500)]
hr casual registered temp_c feels_like_c hum windspeed weathersit season holiday workingday datetime month day day_of_week year total
47627 17.0 198 1514 26.6 26.6 0.4366 18.6 1 2 0 1 2016-06-07 6 6 1 2016 1712
47651 17.0 194 1547 22.2 22.2 0.2903 37.4 1 2 0 1 2016-06-08 6 6 2 2016 1741
47795 17.0 250 1588 26.7 26.7 0.3243 7.4 1 2 0 1 2016-06-14 6 6 1 2016 1838
47987 17.0 245 1513 27.8 27.4 0.3788 16.5 1 3 0 1 2016-06-22 6 6 2 2016 1758
48107 17.0 212 1534 30.0 32.6 0.5910 20.4 1 3 0 1 2016-06-27 6 6 0 2016 1746
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
77216 17.0 93 1555 19.3 19.3 0.3202 15.0 1 4 0 1 2019-10-23 10 10 2 2019 1648
77336 17.0 92 1554 19.3 19.3 0.5487 11.2 1 4 0 1 2019-10-28 10 10 0 2019 1646
111796 17.0 678 1507 28.5 28.3 0.4210 1.1 1 4 0 1 2023-10-03 10 10 1 2023 2185
111820 17.0 647 1528 27.0 27.1 0.4461 11.0 1 4 0 1 2023-10-04 10 10 2 2023 2175
112324 17.0 728 1534 24.9 24.9 0.3097 7.8 1 4 0 1 2023-10-25 10 10 2 2023 2262

77 rows × 17 columns

bikes[(bikes['year'] == 2018) & (bikes['season'] == 3)].head(20)
hr casual registered temp_c feels_like_c hum windspeed weathersit season holiday workingday datetime month day day_of_week year total
65464 0.0 13 36 23.3 23.3 0.8523 0.0 2 3 0 1 2018-06-21 6 6 3 2018 49
65465 1.0 7 23 23.3 23.3 0.8531 11.3 2 3 0 1 2018-06-21 6 6 3 2018 30
65466 2.0 2 9 22.8 22.8 0.8774 9.5 2 3 0 1 2018-06-21 6 6 3 2018 11
65467 3.0 0 5 22.8 22.8 0.8780 5.4 2 3 0 1 2018-06-21 6 6 3 2018 5
65468 4.0 2 21 22.8 22.8 0.8790 0.0 1 3 0 1 2018-06-21 6 6 3 2018 23
65469 5.0 8 106 22.7 22.7 0.8798 7.5 1 3 0 1 2018-06-21 6 6 3 2018 114
65470 6.0 17 395 22.8 22.8 0.8801 7.5 1 3 0 1 2018-06-21 6 6 3 2018 412
65471 7.0 40 881 23.3 23.3 0.8800 5.4 1 3 0 1 2018-06-21 6 6 3 2018 921
65472 8.0 95 1241 24.3 24.3 0.8268 7.2 1 3 0 1 2018-06-21 6 6 3 2018 1336
65473 9.0 88 571 26.1 26.1 0.7442 7.5 1 3 0 1 2018-06-21 6 6 3 2018 659
65474 10.0 113 261 26.1 26.1 0.7644 7.7 1 3 0 1 2018-06-21 6 6 3 2018 374
65475 11.0 121 297 27.1 29.1 0.6977 12.7 2 3 0 1 2018-06-21 6 6 3 2018 418
65476 12.0 131 328 27.1 29.1 0.6986 18.0 1 3 0 1 2018-06-21 6 6 3 2018 459
65477 13.0 171 403 27.1 29.5 0.7450 18.2 1 3 0 1 2018-06-21 6 6 3 2018 574
65478 14.0 196 364 27.8 29.7 0.6527 18.2 1 3 0 1 2018-06-21 6 6 3 2018 560
65479 15.0 225 441 28.2 30.3 0.6350 16.0 2 3 0 1 2018-06-21 6 6 3 2018 666
65480 16.0 259 773 29.2 31.4 0.6003 18.1 1 3 0 1 2018-06-21 6 6 3 2018 1032
65481 17.0 354 1341 28.3 30.0 0.6098 14.9 2 3 0 1 2018-06-21 6 6 3 2018 1695
65482 18.0 289 1188 27.8 29.4 0.6276 16.3 2 3 0 1 2018-06-21 6 6 3 2018 1477
65483 19.0 218 782 27.1 28.6 0.6523 11.0 2 3 0 1 2018-06-21 6 6 3 2018 1000

What is the peak hour?

between 6am and 8am, and between 4pm and 6pm. However in weekends the peak is not as high but people rides between 8am and 8pm.

image.png

What % of total daily rents this windows happens?

Aprox 69% of the daily rents happens in the windows of 6-10am and 2-8pm.

On weekdays, The morning window represents 23% The afternoon window represents 46%

On weekends, The total rents is distributed more consistently throughout the day. No peak hours detected. just consistence between 6am and 8pm with more density on 12pm and 3pm

import pandas as pd
import numpy as np

# Define peak hour windows
# 6-8 AM (hours 6, 7)
peak_morning_start = 6
peak_morning_end = 10 # Exclusive, so includes hour 7

# 2-8 PM (hours 14, 15, 16, 17, 18, 19)
peak_afternoon_start = 14
peak_afternoon_end = 20 # Exclusive, so includes hour 19

# Calculate total daily rentals
daily_rentals = bikes.groupby(['year', 'month', 'day', 'day_of_week'])['total'].sum().reset_index()
daily_rentals = daily_rentals.rename(columns={'total': 'daily_total_rentals'})

# Calculate rentals during peak morning hours
peak_morning_rentals = bikes[
    (bikes['hr'] >= peak_morning_start) & (bikes['hr'] < peak_morning_end)
].groupby(['year', 'month', 'day', 'day_of_week'])['total'].sum().reset_index()
peak_morning_rentals = peak_morning_rentals.rename(columns={'total': 'peak_morning_rentals'})

# Calculate rentals during peak afternoon hours
peak_afternoon_rentals = bikes[
    (bikes['hr'] >= peak_afternoon_start) & (bikes['hr'] < peak_afternoon_end)
].groupby(['year', 'month', 'day', 'day_of_week'])['total'].sum().reset_index()
peak_afternoon_rentals = peak_afternoon_rentals.rename(columns={'total': 'peak_afternoon_rentals'})

# Merge all dataframes
merged_rentals = pd.merge(daily_rentals, peak_morning_rentals, on=['year', 'month', 'day', 'day_of_week'], how='left')
merged_rentals = pd.merge(merged_rentals, peak_afternoon_rentals, on=['year', 'month', 'day', 'day_of_week'], how='left')

# Fill NaN values (days without rentals in a specific peak window) with 0
merged_rentals['peak_morning_rentals'] = merged_rentals['peak_morning_rentals'].fillna(0)
merged_rentals['peak_afternoon_rentals'] = merged_rentals['peak_afternoon_rentals'].fillna(0)

# Calculate total peak hour rentals
merged_rentals['total_peak_rentals'] = merged_rentals['peak_morning_rentals'] + merged_rentals['peak_afternoon_rentals']

# Calculate percentage of total daily rentals during peak hours
# Handle potential division by zero if daily_total_rentals is 0
merged_rentals['peak_percentage'] = (
    merged_rentals['total_peak_rentals'] / merged_rentals['daily_total_rentals']
) * 100
merged_rentals['peak_percentage'] = merged_rentals['peak_percentage'].replace([np.inf, -np.inf], 0).fillna(0)


# Calculate the average peak percentage by day of the week
peak_hour_percentages = merged_rentals.groupby('day_of_week')['peak_percentage'].mean().reset_index()

# Map day_of_week to names for better readability
day_names = {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'}
peak_hour_percentages['day_of_week_name'] = peak_hour_percentages['day_of_week'].map(day_names)

print("Average percentage of daily total bike rentals during peak hours (6-8 AM and 2-8 PM) by day of the week:")
display(peak_hour_percentages)
Average percentage of daily total bike rentals during peak hours (6-8 AM and 2-8 PM) by day of the week:
day_of_week peak_percentage day_of_week_name
0 0 69.003372 Monday
1 1 70.681812 Tuesday
2 2 70.072425 Wednesday
3 3 69.145699 Thursday
4 4 66.649259 Friday
5 5 53.598131 Saturday
6 6 53.498964 Sunday
((merged_rentals[merged_rentals['day_of_week']<5]['peak_morning_rentals'] / merged_rentals[merged_rentals['day_of_week']<5]['daily_total_rentals'])*100).mean()
np.float64(23.237825497153576)
((merged_rentals[merged_rentals['day_of_week']<5]['peak_afternoon_rentals'] / merged_rentals[merged_rentals['day_of_week']<5]['daily_total_rentals'])*100).mean()
np.float64(45.872687802227084)
merged_rentals[merged_rentals['day_of_week']<5]['peak_percentage'].mean()
np.float64(69.11051329938066)
((merged_rentals[merged_rentals['day_of_week']>=5]['peak_morning_rentals'] / merged_rentals[merged_rentals['day_of_week']>=5]['daily_total_rentals'])*100).mean()
np.float64(8.457490553213448)
((merged_rentals[merged_rentals['day_of_week']>=5]['peak_afternoon_rentals'] / merged_rentals[merged_rentals['day_of_week']>=5]['daily_total_rentals'])*100).mean()
np.float64(45.09105719128584)
merged_rentals[merged_rentals['day_of_week']>=5]['peak_percentage'].mean()
np.float64(53.548547744499295)
from lets_plot import *

plot_peak_percentage = (
    ggplot(peak_hour_percentages, aes(x='day_of_week_name', y='peak_percentage', fill='day_of_week_name'))
    + geom_bar(stat='identity')
    + labs(
        title='Average Percentage of Daily Rentals during Peak Hours by Day of the Week',
        x='Day of the Week',
        y='Average Peak Percentage (%)',
        fill='Day of the Week'
    )
    + theme(axis_text_x=element_text(angle=45, hjust=1))
)

display(plot_peak_percentage)
import pandas as pd
import numpy as np
from lets_plot import *

# Ensure 'total' column exists (if not already created)
if 'total' not in bikes.columns:
    bikes['total'] = bikes['casual'] + bikes['registered']

# Create a unique day identifier for grouping
bikes['day_identifier'] = bikes['year'].astype(str) + '-' + bikes['month'].astype(str).str.zfill(2) + '-' + bikes['day'].astype(str).str.zfill(2)

# Calculate total daily rentals
daily_totals = bikes.groupby('day_identifier')['total'].sum().reset_index()
daily_totals.rename(columns={'total': 'daily_total_rentals'}, inplace=True)

# Calculate hourly rentals for casual, registered, and total
hourly_rentals_detailed = bikes.groupby(['day_identifier', 'day_of_week', 'hr'])[['casual', 'registered', 'total']].sum().reset_index()

# Merge with daily totals
hourly_proportions_detailed = pd.merge(hourly_rentals_detailed, daily_totals, on='day_identifier', how='left')

# Calculate the proportion of hourly casual and registered rentals relative to the daily total
hourly_proportions_detailed['casual_proportion'] = hourly_proportions_detailed.apply(
    lambda row: row['casual'] / row['daily_total_rentals'] if row['daily_total_rentals'] > 0 else 0,
    axis=1
)
hourly_proportions_detailed['registered_proportion'] = hourly_proportions_detailed.apply(
    lambda row: row['registered'] / row['daily_total_rentals'] if row['daily_total_rentals'] > 0 else 0,
    axis=1
)

# Melt the detailed proportions for plotting
df_melted_proportions = hourly_proportions_detailed.melt(
    id_vars=['day_identifier', 'day_of_week', 'hr'],
    value_vars=['casual_proportion', 'registered_proportion'],
    var_name='rider_type',
    value_name='proportion'
)

# Calculate the average hourly proportion for each day of the week and rider type
average_hourly_proportions_detailed = df_melted_proportions.groupby(['day_of_week', 'hr', 'rider_type'])['proportion'].mean().reset_index()

# Map day_of_week to names for better readability and ensure correct order
day_names_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_names_map = {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'}
average_hourly_proportions_detailed['day_of_week_name'] = average_hourly_proportions_detailed['day_of_week'].map(day_names_map)
average_hourly_proportions_detailed['day_of_week_name'] = pd.Categorical(average_hourly_proportions_detailed['day_of_week_name'], categories=day_names_order, ordered=True)

# Display the first few rows of the result
print("Average hourly proportions of bike rentals by day of the week and rider type:")
display(average_hourly_proportions_detailed.head())

# Create the bar grid plot
plot_hourly_proportions_detailed = (
    ggplot(average_hourly_proportions_detailed, aes(x='hr', y='proportion', color='rider_type'))
    + geom_line(size=1)
    + facet_grid(x='day_of_week_name', scales='free_x') # Create a subplot for each day of the week
    + labs(
        title='Hourly Proportion of Bike Rentals by Day of the Week and Rider Type',
        x='Hour of Day',
        y='Average Proportion of Daily Rentals',
        color='Rider Type'
    )
    + scale_x_continuous(breaks=list(range(0, 24, 4))) # Set x-axis breaks for hours
    + ggsize(1000, 300)
)

display(plot_hourly_proportions_detailed)
Average hourly proportions of bike rentals by day of the week and rider type:
day_of_week hr rider_type proportion day_of_week_name
0 0 0.0 casual_proportion 0.000286 Monday
1 0 0.0 registered_proportion 0.000541 Monday
2 0 1.0 casual_proportion 0.000141 Monday
3 0 1.0 registered_proportion 0.000257 Monday
4 0 2.0 casual_proportion 0.000079 Monday
# Calculate total daily rentals for each day
bikes['day_identifier'] = bikes['year'].astype(str) + '-' + bikes['month'].astype(str).str.zfill(2) + '-' + bikes['day'].astype(str).str.zfill(2)
daily_totals_df = bikes.groupby('day_identifier')['total'].sum().reset_index()
daily_totals_df.rename(columns={'total': 'daily_total_rentals'}, inplace=True)

# Calculate total casual and registered rentals for each day, and keep the correct day_of_week
daily_rider_totals_df = bikes.groupby('day_identifier').agg({'casual': 'sum', 'registered': 'sum', 'day_of_week': 'first'}).reset_index()

# Merge daily rider totals with daily total rentals
daily_proportions_df = pd.merge(daily_rider_totals_df, daily_totals_df, on='day_identifier')

# Calculate daily proportions
daily_proportions_df['casual_daily_proportion'] = daily_proportions_df.apply(
    lambda row: row['casual'] / row['daily_total_rentals'] if row['daily_total_rentals'] > 0 else 0,
    axis=1
)
daily_proportions_df['registered_daily_proportion'] = daily_proportions_df.apply(
    lambda row: row['registered'] / row['daily_total_rentals'] if row['daily_total_rentals'] > 0 else 0,
    axis=1
)

# Melt for easier averaging and plotting
daily_proportions_melted = daily_proportions_df.melt(
    id_vars=['day_identifier', 'day_of_week'],
    value_vars=['casual_daily_proportion', 'registered_daily_proportion'],
    var_name='rider_type',
    value_name='daily_proportion'
)

# Calculate the average daily proportion for each rider type by day of the week
average_daily_proportions_by_type = daily_proportions_melted.groupby(['day_of_week', 'rider_type'])['daily_proportion'].mean().reset_index()

# Map day_of_week to names for readability
day_names_map = {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'}
average_daily_proportions_by_type['day_of_week_name'] = average_daily_proportions_by_type['day_of_week'].map(day_names_map)
average_daily_proportions_by_type
day_of_week rider_type daily_proportion day_of_week_name
0 0 casual_daily_proportion 0.252194 Monday
1 0 registered_daily_proportion 0.747806 Monday
2 1 casual_daily_proportion 0.250772 Tuesday
3 1 registered_daily_proportion 0.749228 Tuesday
4 2 casual_daily_proportion 0.251250 Wednesday
5 2 registered_daily_proportion 0.748750 Wednesday
6 3 casual_daily_proportion 0.244980 Thursday
7 3 registered_daily_proportion 0.755020 Thursday
8 4 casual_daily_proportion 0.251220 Friday
9 4 registered_daily_proportion 0.748780 Friday
10 5 casual_daily_proportion 0.243747 Saturday
11 5 registered_daily_proportion 0.756253 Saturday
12 6 casual_daily_proportion 0.219971 Sunday
13 6 registered_daily_proportion 0.780029 Sunday
average_hourly_proportions_detailed.groupby('rider_type')['proportion'].sum()
proportion
rider_type
casual_proportion 0.244698
registered_proportion 0.755302

Do weekends matter?

Yes. registered people uses to ride on weekdays, and reduces their frequency on weekends. On the other hand, casual people uses to ride on weekends, mainly on satudays, however sundays still has a big difference in comparison to weekdays.

image.png

how does the amount of bikes differ when it is holidays?

  1. The behavior of types of riders does not vary much. Registered Riders are more plausible to rent than casual ones by far, but in holidays casual ones are more often
  2. Is slightly more on weekdends, but are almost the same. Its interesting that in the middle of the week are more outliers than weekends
  3. Yes. Being ways less frequent than weekdays, holidays’ distribution and total are very similay to weekdays.

from lets_plot import *
import pandas as pd


# Calculate proportions
bikes['casual_proportion'] = bikes['casual'] / bikes['total']
bikes['registered_proportion'] = bikes['registered'] / bikes['total']

# Reshape data for plotting
df_proportions = bikes[['holiday', 'casual_proportion', 'registered_proportion']].melt(
    id_vars=['holiday'], var_name='rider_type', value_name='proportion'
)

# Plot proportions
plot_proportions = ggplot(df_proportions, aes(x='holiday', y='proportion', fill='rider_type')) + \
    geom_boxplot() + \
    labs(title='Rider Proportions by Holiday Status',
         x='Holiday (0: No, 1: Yes)',
         y='Proportion of Total Rides',
         fill='Rider Type')

display(plot_proportions)
ggplot(bikes, aes(x = 'day_of_week', y = 'total')) + geom_boxplot()
from lets_plot import *
import pandas as pd

bikes['total'] = bikes['casual'] + bikes['registered']

plot_total_by_holiday = ggplot(bikes, aes(x = 'holiday', y = 'total')) + \
    geom_boxplot() + \
    labs(title='Total Bike Rentals by Holiday Status',
         x='Holiday (0: No, 1: Yes)',
         y='Total Rides')

display(plot_total_by_holiday)

which season is the most demanded for bike rents?

Spring and Summer are more frequent to rent bikes. However, fall has more outliers. Why?? Covid?

ggplot(bikes, aes('season', 'total')) + geom_boxplot()

how the weather affect to the bikes rents? is temperature important?

  1. The distribution of both, temperature and feels_like are right-skewed. Both shows that people uses more bikes when the temperature or feels_like is close to 26-27°C and decreases when it exceeds this value. -> It is important to normalize it.

  2. Windspeed needs normalization

  3. Registered users don’t care much about humitity or temperature as much as casual

ggplot(bikes, aes('temp_c', 'total')) + geom_point()
ggplot(bikes, aes('temp_c', 'casual')) + geom_point()
ggplot(bikes, aes('temp_c', 'registered')) + geom_point()
ggplot(bikes, aes('feels_like_c', 'total')) + geom_point()
p_casual = (
    ggplot(bikes, aes(x='feels_like_c', y='casual'))
    + geom_point(color='red', alpha=0.4)
    + labs(title='Casual rides vs feels-like temp', x='Feels-like (°C)', y='Casual rides')
)

# Registered = blue
p_registered = (
    ggplot(bikes, aes(x='feels_like_c', y='registered'))
    + geom_point(color='blue', alpha=0.4)
    + labs(title='Registered rides vs feels-like temp', x='Feels-like (°C)', y='Registered rides')
)

display(p_casual)
display(p_registered)
ggplot(bikes, aes('hum', 'total')) + geom_point()
ggplot(bikes, aes('windspeed', 'total')) + geom_point()

What is more important? temp or feels_like?

Feelslike is a mix of hum, windspeed and temp.

Use: TEMP + HUM + WINDSPEED OR FEELSLIKE

Is there any interaction with the weather situation?

No. The proportion keeps being the same no matter the weathersit. The only thinG that changes is the amount. 1 > 2 > 3 >> 4

from lets_plot import *
import pandas as pd

# Ensure 'total' column exists
if 'total' not in bikes.columns:
    bikes['total'] = bikes['casual'] + bikes['registered']

# Map day_of_week to names for better readability
day_names_map = {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'}
bikes['day_of_week_name'] = bikes['day_of_week'].map(day_names_map)

# Map weathersit to names for better readability
weathersit_names = {1: 'Clear', 2: 'Cloudy', 3: 'Light Snow/Rain', 4: 'Heavy Snow/Rain'}
bikes['weathersit_name'] = bikes['weathersit'].map(weathersit_names)

# Melt the data to have 'casual' and 'registered' as a single column for plotting
df_melted_riders = bikes.melt(
    id_vars=['day_of_week_name', 'weathersit_name'],
    value_vars=['casual', 'registered'],
    var_name='rider_type',
    value_name='rentals'
)

# Aggregate rentals by day, weathersit, and rider type for plotting
# This ensures we're plotting sums for each category
aggregated_rentals = df_melted_riders.groupby(['day_of_week_name', 'weathersit_name', 'rider_type'])['rentals'].mean().reset_index()

# Ensure day_of_week_name is ordered correctly for plotting
day_names_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
aggregated_rentals['day_of_week_name'] = pd.Categorical(aggregated_rentals['day_of_week_name'], categories=day_names_order, ordered=True)


plot_weathersit_impact = (
    ggplot(aggregated_rentals, aes(x='day_of_week_name', y='rentals', fill='rider_type'))
    + geom_bar(stat='identity', position='dodge') # 'dodge' for side-by-side bars
    + facet_grid(x='weathersit_name', scales='free_x') # Separate plots by weather situation
    + labs(
        title='Bike Rentals by Day of Week, Rider Type, and Weather Situation',
        x='Day of the Week',
        y='Mean Rentals',
        fill='Rider Type'
    )
    + theme(axis_text_x=element_text(angle=45, hjust=1))
    + ggsize(1000, 600)
)

display(plot_weathersit_impact)

Impact of Covid. Until what time it got back to normality?

between spring 2020 and Fall 2021. Internet says 2nd November. The amount of bike rents increases slightly according to the season until fall 2021, when the rents can be considered that are back in the normality. I would be better to set exact dates.

from lets_plot import *
import pandas as pd

# Ensure 'total' column exists
if 'total' not in bikes.columns:
    bikes['total'] = bikes['casual'] + bikes['registered']

# Create 'year_month' column for all data
bikes['year_month'] = bikes['year'].astype(str) + '-' + bikes['month'].astype(str).str.zfill(2)

# Aggregate total rentals by 'year_month'
timeline_data_all = bikes.groupby('year_month')['total'].sum().reset_index()

# Sort by 'year_month' to ensure correct plotting order
timeline_data_all = timeline_data_all.sort_values('year_month')

# Create custom labels for x-axis: '#month last_2_digits_of_year'
timeline_data_all['display_label'] = timeline_data_all['year_month'].apply(lambda x: f"#{x.split('-')[1]} {x.split('-')[0][-2:]}")

# Plotting the timeline for all data
timeline_plot_all = (
    ggplot(timeline_data_all, aes(x='display_label', y='total'))
    + geom_line(color='blue', size=1)
    + labs(
        title='Total Bike Rentals Over Time (All Data)',
        x='Month-Year',
        y='Total Bike Rentals'
    )
    + theme(axis_text_x=element_text(angle=45, hjust=1))
)

display(timeline_plot_all)
bikes_winter = bikes[bikes['season'] == 1].copy()
bikes_winter['year_month'] = bikes_winter['year'].astype(str) + '-' + bikes_winter['month'].astype(str).str.zfill(2)

timeline_data_winter = bikes_winter.groupby('year_month')['total'].sum().reset_index()
timeline_data_winter = timeline_data_winter.sort_values('year_month')
timeline_data_winter['display_label'] = timeline_data_winter['year_month'].apply(lambda x: f"#{x.split('-')[1]} {x.split('-')[0][-2:]}")

timeline_plot_winter = (
    ggplot(timeline_data_winter, aes(x='display_label', y='total'))
    + geom_line(color='blue', size=1)
    + labs(
        title='Total Bike Rentals Over Time (Winter Season)',
        x='Month-Year',
        y='Total Bike Rentals'
    )
    + theme(axis_text_x=element_text(angle=45, hjust=1))
)

display(timeline_plot_winter)
bikes_spring = bikes[bikes['season'] == 2].copy()
bikes_spring['year_month'] = bikes_spring['year'].astype(str) + '-' + bikes_spring['month'].astype(str).str.zfill(2)

timeline_data_spring = bikes_spring.groupby('year_month')['total'].sum().reset_index()
timeline_data_spring = timeline_data_spring.sort_values('year_month')
timeline_data_spring['display_label'] = timeline_data_spring['year_month'].apply(lambda x: f"#{x.split('-')[1]} {x.split('-')[0][-2:]}")

timeline_plot_spring = (
    ggplot(timeline_data_spring, aes(x='display_label', y='total'))
    + geom_line(color='orange', size=1)
    + labs(
        title='Total Bike Rentals Over Time (Spring Season)',
        x='Month-Year',
        y='Total Bike Rentals'
    )
    + theme(axis_text_x=element_text(angle=45, hjust=1))
)

display(timeline_plot_spring)
bikes_summer = bikes[bikes['season'] == 3].copy()
bikes_summer['year_month'] = bikes_summer['year'].astype(str) + '-' + bikes_summer['month'].astype(str).str.zfill(2)

timeline_data_summer = bikes_summer.groupby('year_month')['total'].sum().reset_index()
timeline_data_summer = timeline_data_summer.sort_values('year_month')
timeline_data_summer['display_label'] = timeline_data_summer['year_month'].apply(lambda x: f"#{x.split('-')[1]} {x.split('-')[0][-2:]}")

timeline_plot_summer = (
    ggplot(timeline_data_summer, aes(x='display_label', y='total'))
    + geom_line(color='green', size=1)
    + labs(
        title='Total Bike Rentals Over Time (Summer Season)',
        x='Month-Year',
        y='Total Bike Rentals'
    )
    + theme(axis_text_x=element_text(angle=45, hjust=1))
)

display(timeline_plot_summer)
bikes_fall = bikes[bikes['season'] == 4].copy()
bikes_fall['year_month'] = bikes_fall['year'].astype(str) + '-' + bikes_fall['month'].astype(str).str.zfill(2)

timeline_data_fall = bikes_fall.groupby('year_month')['total'].sum().reset_index()
timeline_data_fall = timeline_data_fall.sort_values('year_month')
timeline_data_fall['display_label'] = timeline_data_fall['year_month'].apply(lambda x: f"#{x.split('-')[1]} {x.split('-')[0][-2:]}")

timeline_plot_fall = (
    ggplot(timeline_data_fall, aes(x='display_label', y='total'))
    + geom_line(color='red', size=1)
    + labs(
        title='Total Bike Rentals Over Time (Fall Season)',
        x='Month-Year',
        y='Total Bike Rentals'
    )
    + theme(axis_text_x=element_text(angle=45, hjust=1))
)

display(timeline_plot_fall)
from lets_plot import *
import pandas as pd

# Ensure 'total' column exists
if 'total' not in bikes.columns:
    bikes['total'] = bikes['casual'] + bikes['registered']

# Aggregate total rentals by year and season
yearly_seasonal_data = bikes.groupby(['year', 'season'])['total'].sum().reset_index()

# Map season numbers to names for better readability
season_names = {1: 'Winter', 2: 'Spring', 3: 'Summer', 4: 'Fall'}
yearly_seasonal_data['season_name'] = yearly_seasonal_data['season'].map(season_names)

# Plotting the stacked bar chart
plot_yearly_seasonal = ( # Renamed the plot object
    ggplot(yearly_seasonal_data, aes(x='year', y='total', fill='season_name'))
    + geom_bar(stat='identity', position = 'dodge')
    + labs(
        title='Total Bike Rentals by Year and Season',
        x='Year',
        y='Total Bike Rentals',
        fill='Season'
    )
    + theme(axis_text_x=element_text(angle=45, hjust=1))
)

display(plot_yearly_seasonal)

When COVID is considered over?

From March 2020 to October 2021

baseline_periods = bikes[(bikes['year'] >= 2018) & (bikes['year'] <= 2019)].copy()
baseline_monthly_avg = baseline_periods.groupby(['year', 'month'])['total'].mean().reset_index()
post_covid_periods = bikes[(bikes['year'] >= 2020) & (bikes['year'] <= 2022)].copy()
post_covid_monthly_avg = post_covid_periods.groupby(['year', 'month'])['total'].mean().reset_index()
baseline_monthly_avg_std = baseline_monthly_avg.groupby('month')['total'].agg(['mean', 'std']).reset_index()
baseline_monthly_avg_std.columns = ['month', 'baseline_avg', 'baseline_std']

comparison_df = pd.merge(post_covid_monthly_avg, baseline_monthly_avg_std, on='month', how='left')
comparison_df['lower_bound'] = comparison_df['baseline_avg'] - comparison_df['baseline_std']
comparison_df['upper_bound'] = comparison_df['baseline_avg'] + comparison_df['baseline_std']

def check_normality(row):
    if row['total'] >= row['lower_bound'] and row['total'] <= row['upper_bound']:
        return 'Within Normal Range'
    elif row['total'] < row['lower_bound']:
        return 'Below Normal'
    else:
        return 'Above Normal'

comparison_df['normality_status'] = comparison_df.apply(check_normality, axis=1)
comparison_df['year_month'] = comparison_df['year'].astype(str) + '-' + comparison_df['month'].astype(str).str.zfill(2)


print("Comparison DataFrame with normality status (first 5 rows):")
display(comparison_df.head())
Comparison DataFrame with normality status (first 5 rows):
year month total baseline_avg baseline_std lower_bound upper_bound normality_status year_month
0 2020 1 264.743280 214.630376 16.926844 197.703533 231.557220 Above Normal 2020-01
1 2020 2 255.604885 253.354167 25.514770 227.839397 278.868936 Within Normal Range 2020-02
2 2020 3 218.748318 331.634590 14.097406 317.537183 345.731996 Below Normal 2020-03
3 2020 4 104.820833 470.068750 18.743240 451.325510 488.811990 Below Normal 2020-04
4 2020 5 200.337366 478.372984 34.605464 443.767520 512.978448 Below Normal 2020-05
normality_counts = comparison_df['normality_status'].value_counts()

print("Count of months by normality status:")
display(normality_counts)
Count of months by normality status:
count
normality_status
Below Normal 23
Within Normal Range 7
Above Normal 6

comparison_df = comparison_df.sort_values('year_month')

# Melt the DataFrame to prepare for plotting multiple lines with a single color aesthetic
comparison_df_for_plot = comparison_df.melt(
    id_vars=['year_month', 'lower_bound', 'upper_bound', 'normality_status', 'year'],
    value_vars=['total', 'baseline_avg'], # 'total' is the post_covid_avg in this df
    var_name='metric_type',
    value_name='value'
)

plot_normality = (
    ggplot(comparison_df_for_plot, aes(x='year_month', y='value', color='metric_type'))
    + geom_line(size=1)
    + geom_ribbon(aes(ymin='lower_bound', ymax='upper_bound'), fill='gray', alpha=0.3, show_legend=False) # Removed group=1
    + labs(
        title='Post-COVID Monthly Averages vs. Baseline Normality',
        x='Month-Year',
        y='Total Bike Rentals',
        color='Metric' # Legend title
    )
    + scale_color_manual(values={'total': 'blue', 'baseline_avg': 'red'}, labels={'total': 'Post-COVID Average', 'baseline_avg': 'Baseline Average'})
    + theme(axis_text_x=element_text(angle=45, hjust=1))
)

display(plot_normality)

r2 = r2_score(y_test,predictions)
r2
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipython-input-391/4066346188.py in <cell line: 0>()
----> 1 r2 = r2_score(y_test,predictions)
      2 r2

NameError: name 'y_test' is not defined

¿Cuales son los limites que tiene el negocio? ¿Que es lo que le gustaria saber?

The proportions after COVID changed? Registered keeps being the leader? what changed after COVID?

The behavior post-covid changed! More people are interested on riding bikes as hobby: The registered proportion keeps being the same, but the casual one increased -> total of more people riding bikes caused by the increment of casual riders

# After covid?
import pandas as pd
import numpy as np

# Define peak hour windows
# 6-8 AM (hours 6, 7)
peak_morning_start = 6
peak_morning_end = 10 # Exclusive, so includes hour 7

# 2-8 PM (hours 14, 15, 16, 17, 18, 19)
peak_afternoon_start = 14
peak_afternoon_end = 20 # Exclusive, so includes hour 19

# Calculate total daily rentals
daily_rentals = bikes[(bikes['year'] >= 2022)].groupby(['year', 'month', 'day', 'day_of_week'])['total'].sum().reset_index()
daily_rentals = daily_rentals.rename(columns={'total': 'daily_total_rentals'})

# Calculate rentals during peak morning hours
peak_morning_rentals = bikes[
    (bikes['year'] >= 2022) & (bikes['hr'] >= peak_morning_start) & (bikes['hr'] < peak_morning_end)
].groupby(['year', 'month', 'day', 'day_of_week'])['total'].sum().reset_index()
peak_morning_rentals = peak_morning_rentals.rename(columns={'total': 'peak_morning_rentals'})

# Calculate rentals during peak afternoon hours
peak_afternoon_rentals = bikes[
    (bikes['year'] >= 2022) & (bikes['hr'] >= peak_afternoon_start) & (bikes['hr'] < peak_afternoon_end)
].groupby(['year', 'month', 'day', 'day_of_week'])['total'].sum().reset_index()
peak_afternoon_rentals = peak_afternoon_rentals.rename(columns={'total': 'peak_afternoon_rentals'})

# Merge all dataframes
merged_rentals = pd.merge(daily_rentals, peak_morning_rentals, on=['year', 'month', 'day', 'day_of_week'], how='left')
merged_rentals = pd.merge(merged_rentals, peak_afternoon_rentals, on=['year', 'month', 'day', 'day_of_week'], how='left')

# Fill NaN values (days without rentals in a specific peak window) with 0
merged_rentals['peak_morning_rentals'] = merged_rentals['peak_morning_rentals'].fillna(0)
merged_rentals['peak_afternoon_rentals'] = merged_rentals['peak_afternoon_rentals'].fillna(0)

# Calculate total peak hour rentals
merged_rentals['total_peak_rentals'] = merged_rentals['peak_morning_rentals'] + merged_rentals['peak_afternoon_rentals']

# Calculate percentage of total daily rentals during peak hours
# Handle potential division by zero if daily_total_rentals is 0
merged_rentals['peak_percentage'] = (
    merged_rentals['total_peak_rentals'] / merged_rentals['daily_total_rentals']
) * 100
merged_rentals['peak_percentage'] = merged_rentals['peak_percentage'].replace([np.inf, -np.inf], 0).fillna(0)


# Calculate the average peak percentage by day of the week
peak_hour_percentages = merged_rentals.groupby('day_of_week')['peak_percentage'].mean().reset_index()

# Map day_of_week to names for better readability
day_names = {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'}
peak_hour_percentages['day_of_week_name'] = peak_hour_percentages['day_of_week'].map(day_names)

print("Average percentage of daily total bike rentals during peak hours (6-8 AM and 2-8 PM) by day of the week:")
display(peak_hour_percentages)
Average percentage of daily total bike rentals during peak hours (6-8 AM and 2-8 PM) by day of the week:
day_of_week peak_percentage day_of_week_name
0 0 65.583915 Monday
1 1 68.527681 Tuesday
2 2 68.095031 Wednesday
3 3 66.734089 Thursday
4 4 63.016763 Friday
5 5 52.989135 Saturday
6 6 53.032983 Sunday
from lets_plot import *

plot_peak_percentage = (
    ggplot(peak_hour_percentages, aes(x='day_of_week_name', y='peak_percentage', fill='day_of_week_name'))
    + geom_bar(stat='identity')
    + labs(
        title='Average Percentage of Daily Rentals during Peak Hours by Day of the Week',
        x='Day of the Week',
        y='Average Peak Percentage (%)',
        fill='Day of the Week'
    )
    + theme(axis_text_x=element_text(angle=45, hjust=1))
)

display(plot_peak_percentage)
import pandas as pd
import numpy as np
from lets_plot import *

# Ensure 'total' column exists (if not already created)
if 'total' not in bikes.columns:
    bikes['total'] = bikes['casual'] + bikes['registered']

# Create a unique day identifier for grouping
bikes['day_identifier'] = bikes['year'].astype(str) + '-' + bikes['month'].astype(str).str.zfill(2) + '-' + bikes['day'].astype(str).str.zfill(2)

# Calculate total daily rentals
daily_totals = bikes[(bikes['year'] >= 2022)].groupby('day_identifier')['total'].sum().reset_index()
daily_totals.rename(columns={'total': 'daily_total_rentals'}, inplace=True)

# Calculate hourly rentals for casual, registered, and total
hourly_rentals_detailed = bikes[(bikes['year'] >= 2022)].groupby(['day_identifier', 'day_of_week', 'hr'])[['casual', 'registered', 'total']].sum().reset_index()

# Merge with daily totals
hourly_proportions_detailed = pd.merge(hourly_rentals_detailed, daily_totals, on='day_identifier', how='left')

# Calculate the proportion of hourly casual and registered rentals relative to the daily total
hourly_proportions_detailed['casual_proportion'] = hourly_proportions_detailed.apply(
    lambda row: row['casual'] / row['daily_total_rentals'] if row['daily_total_rentals'] > 0 else 0,
    axis=1
)
hourly_proportions_detailed['registered_proportion'] = hourly_proportions_detailed.apply(
    lambda row: row['registered'] / row['daily_total_rentals'] if row['daily_total_rentals'] > 0 else 0,
    axis=1
)

# Melt the detailed proportions for plotting
df_melted_proportions = hourly_proportions_detailed.melt(
    id_vars=['day_identifier', 'day_of_week', 'hr'],
    value_vars=['casual_proportion', 'registered_proportion'],
    var_name='rider_type',
    value_name='proportion'
)

# Calculate the average hourly proportion for each day of the week and rider type
average_hourly_proportions_detailed = df_melted_proportions.groupby(['day_of_week', 'hr', 'rider_type'])['proportion'].mean().reset_index()

# Map day_of_week to names for better readability and ensure correct order
day_names_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_names_map = {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'}
average_hourly_proportions_detailed['day_of_week_name'] = average_hourly_proportions_detailed['day_of_week'].map(day_names_map)
average_hourly_proportions_detailed['day_of_week_name'] = pd.Categorical(average_hourly_proportions_detailed['day_of_week_name'], categories=day_names_order, ordered=True)

# Display the first few rows of the result
print("Average hourly proportions of bike rentals by day of the week and rider type:")
display(average_hourly_proportions_detailed.head())

# Create the bar grid plot
plot_hourly_proportions_detailed = (
    ggplot(average_hourly_proportions_detailed, aes(x='hr', y='proportion', color='rider_type'))
    + geom_line(size=1)
    + facet_grid(x='day_of_week_name', scales='free_x') # Create a subplot for each day of the week
    + labs(
        title='Hourly Proportion of Bike Rentals by Day of the Week and Rider Type',
        x='Hour of Day',
        y='Average Proportion of Daily Rentals',
        color='Rider Type'
    )
    + scale_x_continuous(breaks=list(range(0, 24, 4))) # Set x-axis breaks for hours
    + ggsize(1000, 300)
)

display(plot_hourly_proportions_detailed)
Average hourly proportions of bike rentals by day of the week and rider type:
day_of_week hr rider_type proportion day_of_week_name
0 0 0.0 casual_proportion 0.000393 Monday
1 0 0.0 registered_proportion 0.000477 Monday
2 0 1.0 casual_proportion 0.000192 Monday
3 0 1.0 registered_proportion 0.000242 Monday
4 0 2.0 casual_proportion 0.000113 Monday
ggplot(bikes.groupby(['year'])['total'].sum().reset_index(), aes(x='year', y='total')) + geom_line()
bikes_melted = bikes.melt(
    id_vars=['datetime', 'year', 'total'],
    value_vars=['casual', 'registered'],
    var_name='rider_type',
    value_name='total_rents'
)
ggplot(bikes_melted.groupby(['year','rider_type'])['total_rents'].sum().reset_index(), aes(x='year', y='total_rents', color = 'rider_type')) + geom_line()
ggplot(bikes_melted.groupby(['year','rider_type'])[['total_rents']].mean().reset_index(), aes(x='year', y='total_rents', color = 'rider_type')) + geom_line()

Other Questions

👉 ¿Los holidays se comportan más como weekends o como weekdays?

¿Los outliers de fall ocurren solo en 2020–2021? ¿O todos los años?

Comparar media mensual 2018–2019 vs 2020–2022 Cuando la media vuelve a estar dentro de 1 desviación estándar del baseline → “normalidad”.

¿Outliers positivos o negativos? ¿Son eventos específicos?

🌧️ 4. ¿Qué variable climática explica más varianza?

🧮 5. ¿Qué explica mejor la demanda?

What relationship are between the bike rents and the type of events?

events = [

# =============================
# BIKE TO WORK DAY
# =============================

{"name": "bike_to_work_2011", "type": "cycling",
 "start": "2011-05-20 05:00", "end": "2011-05-20 12:00"},

{"name": "bike_to_work_2012", "type": "cycling",
 "start": "2012-05-18 05:00", "end": "2012-05-18 12:00"},

{"name": "bike_to_work_2013", "type": "cycling",
 "start": "2013-05-17 05:00", "end": "2013-05-17 12:00"},

{"name": "bike_to_work_2014", "type": "cycling",
 "start": "2014-05-16 05:00", "end": "2014-05-16 12:00"},

{"name": "bike_to_work_2015", "type": "cycling",
 "start": "2015-05-15 05:00", "end": "2015-05-15 12:00"},

{"name": "bike_to_work_2016", "type": "cycling",
 "start": "2016-05-20 05:00", "end": "2016-05-20 12:00"},

{"name": "bike_to_work_2017", "type": "cycling",
 "start": "2017-05-19 05:00", "end": "2017-05-19 12:00"},

{"name": "bike_to_work_2018", "type": "cycling",
 "start": "2018-05-18 05:00", "end": "2018-05-18 12:00"},

{"name": "bike_to_work_2019", "type": "cycling",
 "start": "2019-05-17 05:00", "end": "2019-05-17 12:00"},

{"name": "bike_to_work_2021", "type": "cycling",
 "start": "2021-05-21 05:00", "end": "2021-05-21 12:00"},

{"name": "bike_to_work_2022", "type": "cycling",
 "start": "2022-05-20 05:00", "end": "2022-05-20 12:00"},

{"name": "bike_to_work_2023", "type": "cycling",
 "start": "2023-05-19 05:00", "end": "2023-05-19 12:00"},


# =============================
# ARMED FORCES CYCLING CLASSIC
# =============================

{"name": "armed_forces_2021", "type": "cycling",
 "start": "2021-06-05 00:00", "end": "2021-06-06 23:59"},

{"name": "armed_forces_2022", "type": "cycling",
 "start": "2022-06-04 00:00", "end": "2022-06-05 23:59"},

{"name": "armed_forces_2023", "type": "cycling",
 "start": "2023-06-03 00:00", "end": "2023-06-04 23:59"},


# =============================
# WABA 50 STATES RIDE
# =============================

{"name": "waba_2021", "type": "cycling",
 "start": "2021-09-11 00:00", "end": "2021-09-11 23:59"},

{"name": "waba_2022", "type": "cycling",
 "start": "2022-09-10 00:00", "end": "2022-09-10 23:59"},

{"name": "waba_2023", "type": "cycling",
 "start": "2023-09-16 00:00", "end": "2023-09-16 23:59"},


# =============================
# INDEPENDENCE DAY
# =============================

{"name": "july4_2011", "type": "holiday",
 "start": "2011-07-04 00:00", "end": "2011-07-04 23:59"},

{"name": "july4_2012", "type": "holiday",
 "start": "2012-07-04 00:00", "end": "2012-07-04 23:59"},

{"name": "july4_2013", "type": "holiday",
 "start": "2013-07-04 00:00", "end": "2013-07-04 23:59"},

{"name": "july4_2014", "type": "holiday",
 "start": "2014-07-04 00:00", "end": "2014-07-04 23:59"},

{"name": "july4_2015", "type": "holiday",
 "start": "2015-07-04 00:00", "end": "2015-07-04 23:59"},

{"name": "july4_2016", "type": "holiday",
 "start": "2016-07-04 00:00", "end": "2016-07-04 23:59"},

{"name": "july4_2017", "type": "holiday",
 "start": "2017-07-04 00:00", "end": "2017-07-04 23:59"},

{"name": "july4_2018", "type": "holiday",
 "start": "2018-07-04 00:00", "end": "2018-07-04 23:59"},

{"name": "july4_2019", "type": "holiday",
 "start": "2019-07-04 00:00", "end": "2019-07-04 23:59"},


# =============================
# NATIONAL MALL FESTIVALS
# =============================

{"name": "memorial_day_2018", "type": "holiday",
 "start": "2018-05-26 00:00", "end": "2018-05-28 23:59"},

{"name": "memorial_day_2019", "type": "holiday",
 "start": "2019-05-25 00:00", "end": "2019-05-27 23:59"},


{"name": "labor_day_2018", "type": "holiday",
 "start": "2018-09-01 00:00", "end": "2018-09-03 23:59"},

{"name": "labor_day_2019", "type": "holiday",
 "start": "2019-08-31 00:00", "end": "2019-09-02 23:59"},


# =============================
# NATIONAL MARATHON
# =============================

{"name": "dc_marathon_2018", "type": "sport",
 "start": "2018-03-18 05:00", "end": "2018-03-18 14:00"},

{"name": "dc_marathon_2019", "type": "sport",
 "start": "2019-03-17 05:00", "end": "2019-03-17 14:00"},


# =============================
# SMITHSONIAN FOLKLIFE FESTIVAL
# =============================

{"name": "smithsonian_2018", "type": "festival",
 "start": "2018-06-27 00:00", "end": "2018-07-08 23:59"},

{"name": "smithsonian_2019", "type": "festival",
 "start": "2019-06-26 00:00", "end": "2019-07-07 23:59"}

]

event_dates = pd.to_datetime(fechas_eventos).date

MODELING

Neural Network

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#%%
#Model testing first
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dropout, Flatten, Dense, BatchNormalization
from sklearn.metrics import root_mean_squared_error, r2_score
import numpy as np

# Load the dataset
data = pd.read_csv("https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/bikes.csv")


# Display the first few rows of the dataset
data.head()
# %%
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112475 entries, 0 to 112474
Data columns (total 12 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   dteday        112475 non-null  object 
 1   hr            112475 non-null  float64
 2   casual        112475 non-null  int64  
 3   registered    112475 non-null  int64  
 4   temp_c        112475 non-null  float64
 5   feels_like_c  112475 non-null  float64
 6   hum           112475 non-null  float64
 7   windspeed     112475 non-null  float64
 8   weathersit    112475 non-null  int64  
 9   season        112475 non-null  int64  
 10  holiday       112475 non-null  int64  
 11  workingday    112475 non-null  int64  
dtypes: float64(5), int64(6), object(1)
memory usage: 10.3+ MB
events = [

    # =============================
    # PRESIDENTIAL INAUGURATIONS
    # =============================
    {"name": "inauguration_2013", "type": "political",
     "start": "2013-01-20 00:00", "end": "2013-01-21 23:59"},

    {"name": "inauguration_2017", "type": "political",
     "start": "2017-01-19 00:00", "end": "2017-01-21 23:59"},

    {"name": "inauguration_2021", "type": "political",
     "start": "2021-01-19 00:00", "end": "2021-01-21 23:59"},


    # =============================
    # MAJOR POLITICAL PROTESTS
    # =============================
    {"name": "womens_march_2017", "type": "political",
     "start": "2017-01-20 18:00", "end": "2017-01-21 23:59"},

    {"name": "march_for_our_lives_2018", "type": "political",
     "start": "2018-03-23 18:00", "end": "2018-03-24 23:59"},

    {"name": "george_floyd_protests_2020", "type": "political",
     "start": "2020-05-29 18:00", "end": "2020-06-07 23:59"},


    # =============================
    # CYCLING EVENTS (DC Bike Ride)
    # =============================
    {"name": "dc_bike_ride_2016", "type": "cycling",
     "start": "2016-05-14 00:00", "end": "2016-05-15 23:59"},

    {"name": "dc_bike_ride_2017", "type": "cycling",
     "start": "2017-05-13 00:00", "end": "2017-05-14 23:59"},

    {"name": "dc_bike_ride_2018", "type": "cycling",
     "start": "2018-05-19 00:00", "end": "2018-05-20 23:59"},

    {"name": "dc_bike_ride_2019", "type": "cycling",
     "start": "2019-05-18 00:00", "end": "2019-05-19 23:59"},

    {"name": "dc_bike_ride_2022", "type": "cycling",
     "start": "2022-09-09 00:00", "end": "2022-09-11 23:59"},

    {"name": "dc_bike_ride_2023", "type": "cycling",
     "start": "2023-09-08 00:00", "end": "2023-09-10 23:59"},


    # =============================
    # CHERRY BLOSSOM FESTIVAL (Peak weeks)
    # =============================
    {"name": "cherry_blossom_2018", "type": "cultural",
     "start": "2018-04-01 00:00", "end": "2018-04-15 23:59"},

    {"name": "cherry_blossom_2019", "type": "cultural",
     "start": "2019-04-01 00:00", "end": "2019-04-14 23:59"},

    {"name": "cherry_blossom_2022", "type": "cultural",
     "start": "2022-03-20 00:00", "end": "2022-04-17 23:59"},

    {"name": "cherry_blossom_2023", "type": "cultural",
     "start": "2023-03-18 00:00", "end": "2023-04-16 23:59"},
]

fechas_eventos = [
    # DC Bike Ride (Suele ser mayo o septiembre)
    '2016-05-22', '2017-05-14', '2018-05-19', '2019-05-18', '2021-09-25', '2022-09-10', '2023-09-09',
    # Bike to Work Day (Tercer viernes de mayo)
    '2011-05-20', '2012-05-18', '2013-05-17', '2014-05-16', '2015-05-15', '2016-05-20',
    '2017-05-19', '2018-05-18', '2019-05-17', '2020-05-15', '2021-05-21', '2022-05-20', '2023-05-19',
    # WABA 50 States Ride (Septiembre)
    '2021-09-11', '2022-09-10', '2023-09-16',
    # Armed Forces Cycling Classic (Junio)
    '2021-06-05', '2021-06-06', '2022-06-04', '2022-06-05', '2023-06-03', '2023-06-04'
]

event_dates = pd.to_datetime(fechas_eventos).date
def rawtransformer(data):
    data['total'] = data['casual'] + data['registered']


    data["datetime"] = pd.to_datetime(data["dteday"])
    data["month"] = data["datetime"].dt.month
    data['day'] = data['datetime'].dt.day
    data["day_of_week"] = data["datetime"].dt.dayofweek
    data["year"] = data["datetime"].dt.year

    data["hour_sin"] = np.sin(2 * np.pi * data["hr"] / 24)
    data["hour_cos"] = np.cos(2 * np.pi * data["hr"] / 24)
    data["dow_sin"] = np.sin(2 * np.pi * data["day_of_week"] / 7)
    data["dow_cos"] = np.cos(2 * np.pi * data["day_of_week"] / 7)
    data["month_sin"] = np.sin(2 * np.pi * data["month"] / 12)
    data["month_cos"] = np.cos(2 * np.pi * data["month"] / 12)
    data["time_index"] = np.arange(len(data))

    data = data.drop(columns=['hr','day_of_week'])


    data["mean_roll_year_c"] = data["casual"].shift(8760).rolling(5, center = True).mean()
    data["mean_roll_year+5_c"] = data["casual"].shift(8765).rolling(5, center = True).mean()
    data["mean_roll_year+24_c"] = data["casual"].shift(8784).rolling(5, center = True).mean()
    data["mean_roll_year+168_c"] = data["casual"].shift(8928).rolling(5, center = True).mean()

    data["mean_roll_year_r"] = data["registered"].shift(8760).rolling(5, center = True).mean()
    data["mean_roll_5_r"] = data["registered"].shift(8765).rolling(5, center = True).mean()
    data["mean_roll_24_r"] = data["registered"].shift(8784).rolling(5, center = True).mean()
    data["mean_roll_168_r"] = data["registered"].shift(8928).rolling(5, center = True).mean()

    data["median_roll_year_r"] = data["registered"].shift(8760).rolling(5, center = True).median()
    data["median_roll_year+5_r"] = data["registered"].shift(8765).rolling(5, center = True).median()
    data["median_roll_year+24_r"] = data["registered"].shift(8784).rolling(5, center = True).median()
    data["median_roll_year+168_r"] = data["registered"].shift(8928).rolling(5, center = True).median()

    data["median_roll_year_c"] = data["casual"].shift(8760).rolling(5, center = True).median()
    data["median_roll_year+5_c"] = data["casual"].shift(8765).rolling(5, center = True).median()
    data["median_roll_year+24_c"] = data["casual"].shift(8784).rolling(5, center = True).median()
    data["median_roll_year+168_c"] = data["casual"].shift(8928).rolling(5, center = True).median()

    data["std_roll_year_c"] = data["casual"].shift(8760).rolling(5, center = True).std()
    data["std_roll_year+5_c"] = data["casual"].shift(8765).rolling(5, center = True).std()
    data["std_roll_year+24_c"] = data["casual"].shift(8784).rolling(5, center = True).std()
    data["std_roll_year+168_c"] = data["casual"].shift(8928).rolling(5, center = True).std()

    data["std_roll_year_r"] = data["registered"].shift(8760).rolling(5, center = True).std()
    data["std_roll_year+5_r"] = data["registered"].shift(8765).rolling(5, center = True).std()
    data["std_roll_year+24_r"] = data["registered"].shift(8784).rolling(5, center = True).std()
    data["std_roll_year+168_r"] = data["registered"].shift(8928).rolling(5, center = True).std()
    #create a window for ranges of hours.
    # Parameter Matrix
    #  Matrix


    # covid_start = pd.Timestamp("2020-03-01 00:00")
    # covid_end = pd.Timestamp("2021-10-01 00:00")

    # data['pre_covid'] = data['datetime'] < covid_start
    # data['covid'] = (data['datetime'] >= covid_start) & (data['datetime'] <= covid_end)
    # data['post_covid'] = data['datetime'] > covid_end

    data["time_index"] = np.arange(len(data))
    def apply_events(df, events_list):

      df['event_general'] = False
      df['event_political'] = False
      df['event_cycling'] = False
      df['event_cultural'] = False

      for event in events_list:

          start = pd.Timestamp(event['start'])
          end = pd.Timestamp(event['end'])

          mask = (df['datetime'] >= start) & (df['datetime'] <= end)

          df.loc[mask, 'event_general'] = True

          if event['type'] == "political":
              df.loc[mask, 'event_political'] = True
          elif event['type'] == "cycling":
              df.loc[mask, 'event_cycling'] = True
          elif event['type'] == "cultural":
              df.loc[mask, 'event_cultural'] = True

      return df
    data = apply_events(data, events)
    data['event_cycling'] = data['datetime'].dt.date.isin(event_dates)



    data = data.drop(columns = ["datetime","dteday"])

    return data

data = rawtransformer(data)
data = data[~data['year'].isin([2020, 2021])]

data = data.dropna()

# X = data.drop(columns=["casual","registered","total"])
# y = data["total"]

# from sklearn.model_selection import TimeSeriesSplit

# tscv = TimeSeriesSplit(n_splits=3)

# for train_index, val_index in tscv.split(X):
#     X_train, X_test = X.iloc[train_index], X.iloc[val_index]
#     y_train, y_test = y.iloc[train_index], y.iloc[val_index]

    # print("Train size:", len(train_index))
    # print("Val size:", len(val_index))

X_all = data.drop(columns=["casual","registered","total"])
y_all = data["total"]

# def create_sequences(X, y, window=168):
#     Xs, ys = [], []
#     X = X.values
#     y = y.values

#     for i in range(len(X) - window):
#         Xs.append(X[i:i+window])
#         ys.append(y[i+window])

#     return np.array(Xs), np.array(ys)

# X_seq, y_seq = create_sequences(X_all, y_all, window=168)

# dates = data["year"].values[168:]
# months = data["month"].values[168:]

# train_idx = dat <= 2022
# val_idx   = (dates == 2023) & (months < 7)
# test_idx  = (dates == 2023) & (months >= 7)

X_train_seq = data[data['year'] <= 2022].drop(columns=["casual","registered","total"])
y_train_seq = data[data['year'] <= 2022]["total"]

X_val_seq = data[(data['year'] == 2023) & (data['month']<7)].drop(columns=["casual","registered","total"])
y_val_seq = data[(data['year'] == 2023) & (data['month']<7)]['total']

X_test_seq = data[(data['year'] == 2023) & (data['month']>=7)].drop(columns=["casual","registered","total"])
y_test_seq = data[(data['year'] == 2023) & (data['month']>=7)]['total']

feature_columns = X_train_seq.columns
# print(f"Train Mean: {y_train_seq.mean()}")
# print(f"Val Mean: {y_val_seq.mean()}")
# print(f"Test Mean: {y_test_seq.mean()}")
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 86001 entries, 8930 to 112472
Data columns (total 49 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   casual                  86001 non-null  int64  
 1   registered              86001 non-null  int64  
 2   temp_c                  86001 non-null  float64
 3   feels_like_c            86001 non-null  float64
 4   hum                     86001 non-null  float64
 5   windspeed               86001 non-null  float64
 6   weathersit              86001 non-null  int64  
 7   season                  86001 non-null  int64  
 8   holiday                 86001 non-null  int64  
 9   workingday              86001 non-null  int64  
 10  total                   86001 non-null  int64  
 11  month                   86001 non-null  int32  
 12  day                     86001 non-null  int32  
 13  year                    86001 non-null  int32  
 14  hour_sin                86001 non-null  float64
 15  hour_cos                86001 non-null  float64
 16  dow_sin                 86001 non-null  float64
 17  dow_cos                 86001 non-null  float64
 18  month_sin               86001 non-null  float64
 19  month_cos               86001 non-null  float64
 20  time_index              86001 non-null  int64  
 21  mean_roll_year_c        86001 non-null  float64
 22  mean_roll_year+5_c      86001 non-null  float64
 23  mean_roll_year+24_c     86001 non-null  float64
 24  mean_roll_year+168_c    86001 non-null  float64
 25  mean_roll_year_r        86001 non-null  float64
 26  mean_roll_5_r           86001 non-null  float64
 27  mean_roll_24_r          86001 non-null  float64
 28  mean_roll_168_r         86001 non-null  float64
 29  median_roll_year_r      86001 non-null  float64
 30  median_roll_year+5_r    86001 non-null  float64
 31  median_roll_year+24_r   86001 non-null  float64
 32  median_roll_year+168_r  86001 non-null  float64
 33  median_roll_year_c      86001 non-null  float64
 34  median_roll_year+5_c    86001 non-null  float64
 35  median_roll_year+24_c   86001 non-null  float64
 36  median_roll_year+168_c  86001 non-null  float64
 37  std_roll_year_c         86001 non-null  float64
 38  std_roll_year+5_c       86001 non-null  float64
 39  std_roll_year+24_c      86001 non-null  float64
 40  std_roll_year+168_c     86001 non-null  float64
 41  std_roll_year_r         86001 non-null  float64
 42  std_roll_year+5_r       86001 non-null  float64
 43  std_roll_year+24_r      86001 non-null  float64
 44  std_roll_year+168_r     86001 non-null  float64
 45  event_general           86001 non-null  bool   
 46  event_political         86001 non-null  bool   
 47  event_cycling           86001 non-null  bool   
 48  event_cultural          86001 non-null  bool   
dtypes: bool(4), float64(34), int32(3), int64(8)
memory usage: 29.5 MB
import numpy as np # Import numpy for isnan
import tensorflow.keras.backend as K
from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler = MinMaxScaler()

X_train_seq = scaler.fit_transform(X_train_seq)
X_val_seq = scaler.transform(X_val_seq)
X_test_seq = scaler.transform(X_test_seq)
# # reshape a 2D para escalar
# nsamples, ntimesteps, nfeatures = X_train_seq.shape
# X_train_reshaped = X_train_seq.reshape(-1, nfeatures)

# scaler.fit(X_train_reshaped)

# # aplicar a train
# X_train_scaled = scaler.transform(X_train_reshaped)
# X_train_seq = X_train_scaled.reshape(nsamples, ntimesteps, nfeatures)

# # repetir para val y test
# def scale_3d(X, scaler):
#     nsamples, ntimesteps, nfeatures = X.shape
#     Xr = X.reshape(-1, nfeatures)
#     Xs = scaler.transform(Xr)
#     return Xs.reshape(nsamples, ntimesteps, nfeatures)

# X_val_seq = scale_3d(X_val_seq, scaler)
# X_test_seq = scale_3d(X_test_seq, scaler)
# Define custom RMSE metric (though not used for loss, needed for previous plots, keeping for now)
def rmse_metric(y_true, y_pred):
    return K.sqrt(K.mean(K.square(K.cast(y_true, dtype='float32') - y_pred)))

model = Sequential([
    keras.layers.Dense(256, activation='relu'),
    Dropout(0.2),
    keras.layers.Dense(128, activation='relu'),
    Dropout(0.2),
    keras.layers.Dense(64, activation='relu'),
    Dropout(0.1),
    keras.layers.Dense(1, activation='relu')#,
   # Dropout(0.1),
   # keras.layers.Dense(1)
])

#%%
model.compile(
    loss=tf.keras.losses.Huber(),
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),       # explicit LR so scheduler can adjust it
    metrics=['mse']
)

early_stop = keras.callbacks.EarlyStopping(
    monitor='val_mse',
    mode='min', # MSE should be minimized
    patience=10,
    restore_best_weights=True
    )

lr_scheduler = keras.callbacks.ReduceLROnPlateau(
    monitor='val_mse',      # watches val MSE
    factor=0.5,             # halves the LR when it plateaus
    patience=5,             # waits 5 epochs of no improvement
    min_lr=1e-6,            # won't go below this
    verbose=1               # prints when LR changes
)

history = model.fit(X_train_seq, y_train_seq,
                    epochs=100,
                    validation_data=(X_val_seq, y_val_seq),
                    batch_size=120,
                    callbacks=[early_stop, lr_scheduler],
                    shuffle=False)
# history = model.fit(train_features, train_labels, epochs=2000, verbose=0, validation_split = .2, batch_size=tester2,
#                     callbacks=[early_stop, PrintDot()])

hist = pd.DataFrame(history.history)

hist = hist.reset_index()

def plot_history():
    plt.figure()
    plt.xlabel('Epoch')
    plt.ylabel('Mean Squared Error') # Updated label
    plt.plot(hist['index'], hist['mse'], label='Train MSE') # Updated metric name
    plt.plot(hist['index'], hist['val_mse'], label = 'Val MSE') # Updated metric name
    plt.legend()
    # plt.ylim([0,50])

plot_history()
Epoch 1/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 4s 4ms/step - loss: 171.6943 - mse: 62137.5039 - val_loss: 158.4529 - val_mse: 61805.1602 - learning_rate: 0.0010

Epoch 2/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 115.1612 - mse: 34591.4883 - val_loss: 122.5486 - val_mse: 36429.8242 - learning_rate: 0.0010

Epoch 3/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 104.7584 - mse: 28840.0918 - val_loss: 159.0504 - val_mse: 57297.0586 - learning_rate: 0.0010

Epoch 4/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 5ms/step - loss: 94.6565 - mse: 23668.4316 - val_loss: 119.3298 - val_mse: 34644.7070 - learning_rate: 0.0010

Epoch 5/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 89.3867 - mse: 21119.2305 - val_loss: 106.8435 - val_mse: 29968.6113 - learning_rate: 0.0010

Epoch 6/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 83.6655 - mse: 18722.2168 - val_loss: 124.2180 - val_mse: 38523.4336 - learning_rate: 0.0010

Epoch 7/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 83.2141 - mse: 18542.8965 - val_loss: 117.9195 - val_mse: 33978.4375 - learning_rate: 0.0010

Epoch 8/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 4ms/step - loss: 81.1954 - mse: 17625.2129 - val_loss: 122.8660 - val_mse: 37843.3477 - learning_rate: 0.0010

Epoch 9/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 5s 4ms/step - loss: 78.0255 - mse: 16464.2480 - val_loss: 128.2218 - val_mse: 40535.9492 - learning_rate: 0.0010

Epoch 10/100

644/656 ━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 76.9613 - mse: 15658.3838

Epoch 10: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 77.1405 - mse: 15733.8838 - val_loss: 117.7021 - val_mse: 34532.8594 - learning_rate: 0.0010

Epoch 11/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 74.6660 - mse: 14496.9824 - val_loss: 81.5134 - val_mse: 16173.0615 - learning_rate: 5.0000e-04

Epoch 12/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 4ms/step - loss: 69.8538 - mse: 12717.7910 - val_loss: 78.2827 - val_mse: 14801.5996 - learning_rate: 5.0000e-04

Epoch 13/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 67.1995 - mse: 11840.2500 - val_loss: 73.9249 - val_mse: 13187.1875 - learning_rate: 5.0000e-04

Epoch 14/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 65.6392 - mse: 11250.2461 - val_loss: 73.8432 - val_mse: 13502.8574 - learning_rate: 5.0000e-04

Epoch 15/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 63.9760 - mse: 10694.4473 - val_loss: 71.2846 - val_mse: 12217.6113 - learning_rate: 5.0000e-04

Epoch 16/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 63.7568 - mse: 10655.1309 - val_loss: 69.2000 - val_mse: 11580.9609 - learning_rate: 5.0000e-04

Epoch 17/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 4ms/step - loss: 62.2765 - mse: 10150.4619 - val_loss: 68.7633 - val_mse: 11587.2432 - learning_rate: 5.0000e-04

Epoch 18/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 61.8223 - mse: 10001.0801 - val_loss: 67.6973 - val_mse: 11034.1045 - learning_rate: 5.0000e-04

Epoch 19/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 61.4032 - mse: 9932.8262 - val_loss: 68.1054 - val_mse: 11002.4531 - learning_rate: 5.0000e-04

Epoch 20/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 60.4733 - mse: 9625.5166 - val_loss: 65.9462 - val_mse: 10573.5020 - learning_rate: 5.0000e-04

Epoch 21/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 59.4935 - mse: 9366.6836 - val_loss: 65.7611 - val_mse: 10618.4609 - learning_rate: 5.0000e-04

Epoch 22/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 4ms/step - loss: 59.4440 - mse: 9415.0615 - val_loss: 65.3982 - val_mse: 10309.5879 - learning_rate: 5.0000e-04

Epoch 23/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 59.0187 - mse: 9234.9277 - val_loss: 64.9538 - val_mse: 10407.8682 - learning_rate: 5.0000e-04

Epoch 24/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 58.4049 - mse: 9089.1338 - val_loss: 65.4831 - val_mse: 10464.2969 - learning_rate: 5.0000e-04

Epoch 25/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 4ms/step - loss: 57.5919 - mse: 8907.9531 - val_loss: 64.7197 - val_mse: 10128.1973 - learning_rate: 5.0000e-04

Epoch 26/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 5ms/step - loss: 57.3949 - mse: 8850.1172 - val_loss: 65.1304 - val_mse: 10627.2256 - learning_rate: 5.0000e-04

Epoch 27/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 57.0636 - mse: 8756.9082 - val_loss: 65.0660 - val_mse: 10284.1191 - learning_rate: 5.0000e-04

Epoch 28/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 4ms/step - loss: 57.1024 - mse: 8720.4639 - val_loss: 64.9895 - val_mse: 10607.6719 - learning_rate: 5.0000e-04

Epoch 29/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 55.9533 - mse: 8434.8398 - val_loss: 63.8483 - val_mse: 10128.1953 - learning_rate: 5.0000e-04

Epoch 30/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 55.8237 - mse: 8423.3115 - val_loss: 65.6287 - val_mse: 11002.8291 - learning_rate: 5.0000e-04

Epoch 31/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 5ms/step - loss: 55.3933 - mse: 8332.3369 - val_loss: 65.1422 - val_mse: 10660.8916 - learning_rate: 5.0000e-04

Epoch 32/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 55.2293 - mse: 8271.9053 - val_loss: 63.4479 - val_mse: 10184.8721 - learning_rate: 5.0000e-04

Epoch 33/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 55.2234 - mse: 8260.8301 - val_loss: 63.0048 - val_mse: 9940.7764 - learning_rate: 5.0000e-04

Epoch 34/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 4ms/step - loss: 54.7101 - mse: 8096.5244 - val_loss: 64.2976 - val_mse: 10493.5488 - learning_rate: 5.0000e-04

Epoch 35/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 5ms/step - loss: 54.1293 - mse: 8015.8511 - val_loss: 68.7800 - val_mse: 12177.5771 - learning_rate: 5.0000e-04

Epoch 36/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 4s 4ms/step - loss: 54.0016 - mse: 7967.2783 - val_loss: 65.4813 - val_mse: 11050.9004 - learning_rate: 5.0000e-04

Epoch 37/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 54.1864 - mse: 7976.9604 - val_loss: 66.7305 - val_mse: 11665.6748 - learning_rate: 5.0000e-04

Epoch 38/100

649/656 ━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 53.6911 - mse: 7904.3564

Epoch 38: ReduceLROnPlateau reducing learning rate to 0.0002500000118743628.

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 53.7967 - mse: 7936.9995 - val_loss: 65.1544 - val_mse: 10758.7129 - learning_rate: 5.0000e-04

Epoch 39/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 5ms/step - loss: 54.0367 - mse: 7937.0747 - val_loss: 62.8106 - val_mse: 9810.4189 - learning_rate: 2.5000e-04

Epoch 40/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 54.0061 - mse: 7930.8364 - val_loss: 62.4311 - val_mse: 9768.8477 - learning_rate: 2.5000e-04

Epoch 41/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 53.9484 - mse: 7857.0293 - val_loss: 62.6236 - val_mse: 9738.4736 - learning_rate: 2.5000e-04

Epoch 42/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 53.1848 - mse: 7717.3394 - val_loss: 64.5806 - val_mse: 10511.5039 - learning_rate: 2.5000e-04

Epoch 43/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 4ms/step - loss: 53.1865 - mse: 7757.8481 - val_loss: 63.7381 - val_mse: 10270.1396 - learning_rate: 2.5000e-04

Epoch 44/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 4ms/step - loss: 52.9955 - mse: 7655.1968 - val_loss: 62.9741 - val_mse: 9994.5322 - learning_rate: 2.5000e-04

Epoch 45/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 52.8376 - mse: 7691.3823 - val_loss: 61.2169 - val_mse: 9414.3564 - learning_rate: 2.5000e-04

Epoch 46/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 4ms/step - loss: 53.4086 - mse: 7702.5283 - val_loss: 60.4161 - val_mse: 9239.6797 - learning_rate: 2.5000e-04

Epoch 47/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 4ms/step - loss: 52.5412 - mse: 7470.9233 - val_loss: 62.2946 - val_mse: 9844.3955 - learning_rate: 2.5000e-04

Epoch 48/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 5ms/step - loss: 52.6492 - mse: 7649.2876 - val_loss: 63.5267 - val_mse: 10258.5049 - learning_rate: 2.5000e-04

Epoch 49/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 52.4295 - mse: 7537.2129 - val_loss: 61.0916 - val_mse: 9450.5381 - learning_rate: 2.5000e-04

Epoch 50/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 52.4103 - mse: 7555.3081 - val_loss: 61.9142 - val_mse: 9714.6836 - learning_rate: 2.5000e-04

Epoch 51/100

650/656 ━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 52.1541 - mse: 7439.2095

Epoch 51: ReduceLROnPlateau reducing learning rate to 0.0001250000059371814.

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 4ms/step - loss: 52.2365 - mse: 7463.7974 - val_loss: 61.2884 - val_mse: 9545.2256 - learning_rate: 2.5000e-04

Epoch 52/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 5ms/step - loss: 52.7269 - mse: 7594.0063 - val_loss: 66.1167 - val_mse: 11104.6699 - learning_rate: 1.2500e-04

Epoch 53/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 4ms/step - loss: 52.7373 - mse: 7578.1748 - val_loss: 65.4058 - val_mse: 10928.9639 - learning_rate: 1.2500e-04

Epoch 54/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 5s 4ms/step - loss: 52.7517 - mse: 7608.7090 - val_loss: 65.7390 - val_mse: 11039.0449 - learning_rate: 1.2500e-04

Epoch 55/100

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 5ms/step - loss: 52.9372 - mse: 7650.9883 - val_loss: 65.5338 - val_mse: 11039.8652 - learning_rate: 1.2500e-04

Epoch 56/100

644/656 ━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 52.3559 - mse: 7483.3735

Epoch 56: ReduceLROnPlateau reducing learning rate to 6.25000029685907e-05.

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 4ms/step - loss: 52.5019 - mse: 7527.7148 - val_loss: 64.3979 - val_mse: 10591.4756 - learning_rate: 1.2500e-04

import os

# Create the directory if it doesn't exist
output_dir = "saved_model"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

model.save(os.path.join(output_dir, "my_model.keras"))
model = tf.keras.models.load_model("saved_model/my_model.keras")
# Define custom RMSE metric (though not used for loss, needed for previous plots, keeping for now)
def rmse_metric(y_true, y_pred):
    return K.sqrt(K.mean(K.square(K.cast(y_true, dtype='float32') - y_pred)))

model = Sequential([
    keras.layers.Dense(256, activation='relu'),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dense(1)
])

#%%
model.compile(
    loss=tf.keras.losses.Huber(),
    optimizer='adam',
    metrics=['mse']
)
early_stop = keras.callbacks.EarlyStopping(
    monitor='val_mse',
    mode='min', # MSE should be minimized
    patience=30,
    restore_best_weights=True
    )

history = model.fit(X_train_seq, y_train_seq,
                    epochs=50,
                    validation_data=(X_val_seq, y_val_seq),
                    batch_size=120,
                    callbacks=[early_stop],
                    shuffle=False)
# history = model.fit(train_features, train_labels, epochs=2000, verbose=0, validation_split = .2, batch_size=tester2,
#                     callbacks=[early_stop, PrintDot()])

hist = pd.DataFrame(history.history)

hist = hist.reset_index()

def plot_history():
    plt.figure()
    plt.xlabel('Epoch')
    plt.ylabel('Mean Squared Error') # Updated label
    plt.plot(hist['index'], hist['mse'], label='Train MSE') # Updated metric name
    plt.plot(hist['index'], hist['val_mse'], label = 'Val MSE') # Updated metric name
    plt.legend()
    # plt.ylim([0,50])

plot_history()
Epoch 1/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - loss: 177.0041 - mse: 65467.7617 - val_loss: 155.2514 - val_mse: 58704.8477

Epoch 2/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 115.2360 - mse: 34565.5898 - val_loss: 131.9704 - val_mse: 43372.9375

Epoch 3/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 4ms/step - loss: 106.4556 - mse: 29820.1016 - val_loss: 170.2767 - val_mse: 65771.2031

Epoch 4/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 98.9697 - mse: 26139.4395 - val_loss: 130.3973 - val_mse: 42853.7422

Epoch 5/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 94.2295 - mse: 23424.9668 - val_loss: 112.9578 - val_mse: 32780.6133

Epoch 6/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 85.4961 - mse: 19617.9492 - val_loss: 133.2713 - val_mse: 43625.5039

Epoch 7/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 85.4054 - mse: 19325.8730 - val_loss: 117.3442 - val_mse: 33761.0195

Epoch 8/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 80.9757 - mse: 17514.1875 - val_loss: 127.3978 - val_mse: 38987.9648

Epoch 9/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 78.8244 - mse: 16431.1777 - val_loss: 117.2593 - val_mse: 32070.8008

Epoch 10/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 73.0607 - mse: 14103.4990 - val_loss: 108.8991 - val_mse: 28462.9453

Epoch 11/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 67.2988 - mse: 11893.6836 - val_loss: 96.5859 - val_mse: 22754.0527

Epoch 12/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 62.5229 - mse: 10247.6816 - val_loss: 89.9629 - val_mse: 19813.8281

Epoch 13/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 58.1483 - mse: 8980.5908 - val_loss: 78.7392 - val_mse: 15432.2344

Epoch 14/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 55.5634 - mse: 8272.3818 - val_loss: 74.1534 - val_mse: 13757.9189

Epoch 15/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 53.4035 - mse: 7703.4854 - val_loss: 68.1755 - val_mse: 11504.5713

Epoch 16/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 51.7782 - mse: 7320.4243 - val_loss: 68.4671 - val_mse: 11266.7148

Epoch 17/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 50.5732 - mse: 7042.6802 - val_loss: 66.9372 - val_mse: 10720.1250

Epoch 18/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 49.7218 - mse: 6863.8506 - val_loss: 70.5073 - val_mse: 11466.3711

Epoch 19/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 49.1593 - mse: 6729.3506 - val_loss: 71.2258 - val_mse: 11786.0811

Epoch 20/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 48.6881 - mse: 6631.2905 - val_loss: 74.4232 - val_mse: 12562.2246

Epoch 21/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 48.5431 - mse: 6584.1323 - val_loss: 76.5114 - val_mse: 13252.3916

Epoch 22/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 47.8183 - mse: 6413.9668 - val_loss: 74.1191 - val_mse: 12558.2715

Epoch 23/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 47.2853 - mse: 6303.2407 - val_loss: 73.5768 - val_mse: 12413.7041

Epoch 24/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 46.5035 - mse: 6135.6885 - val_loss: 74.5647 - val_mse: 12748.2793

Epoch 25/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 46.4190 - mse: 6119.7393 - val_loss: 78.0207 - val_mse: 13821.5400

Epoch 26/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 45.9991 - mse: 6016.0581 - val_loss: 80.6971 - val_mse: 14599.7051

Epoch 27/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - loss: 45.7829 - mse: 5958.7969 - val_loss: 76.0704 - val_mse: 13270.2988

Epoch 28/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 45.4440 - mse: 5890.7373 - val_loss: 74.5451 - val_mse: 12924.8701

Epoch 29/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 45.0900 - mse: 5827.3799 - val_loss: 74.9124 - val_mse: 13184.1484

Epoch 30/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 44.6988 - mse: 5747.6987 - val_loss: 72.3896 - val_mse: 12642.2832

Epoch 31/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 44.4753 - mse: 5714.1182 - val_loss: 74.8932 - val_mse: 13336.7617

Epoch 32/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 43.9961 - mse: 5610.9395 - val_loss: 73.0048 - val_mse: 13116.3721

Epoch 33/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 43.3358 - mse: 5503.6758 - val_loss: 72.9402 - val_mse: 13423.9150

Epoch 34/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 42.9338 - mse: 5432.5459 - val_loss: 74.8074 - val_mse: 14330.6602

Epoch 35/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 42.4571 - mse: 5335.5684 - val_loss: 75.7855 - val_mse: 14760.5303

Epoch 36/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 41.9950 - mse: 5248.5850 - val_loss: 76.8691 - val_mse: 15257.5547

Epoch 37/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 41.7174 - mse: 5189.6626 - val_loss: 78.3379 - val_mse: 16009.3652

Epoch 38/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - loss: 41.2843 - mse: 5108.6636 - val_loss: 78.7124 - val_mse: 16165.6865

Epoch 39/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 41.0447 - mse: 5050.8623 - val_loss: 81.3668 - val_mse: 17336.6016

Epoch 40/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 40.5814 - mse: 4963.5752 - val_loss: 82.6817 - val_mse: 17953.1191

Epoch 41/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - loss: 40.3390 - mse: 4909.2456 - val_loss: 83.6082 - val_mse: 18467.7227

Epoch 42/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 40.1651 - mse: 4878.9131 - val_loss: 83.9279 - val_mse: 18273.4023

Epoch 43/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 39.9084 - mse: 4832.6309 - val_loss: 84.6179 - val_mse: 18755.1172

Epoch 44/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 39.7691 - mse: 4810.9888 - val_loss: 84.2502 - val_mse: 18158.2852

Epoch 45/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 39.7188 - mse: 4797.0879 - val_loss: 80.9557 - val_mse: 17010.2227

Epoch 46/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 39.7038 - mse: 4787.7207 - val_loss: 81.7412 - val_mse: 17259.9414

Epoch 47/50

656/656 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 39.3375 - mse: 4713.0166 - val_loss: 79.1796 - val_mse: 16153.1641

predictions = model.predict(X_test_seq)
93/93 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step

result = root_mean_squared_error(y_test_seq, predictions)
result
135.4075469970703
r2 = r2_score(y_test_seq,predictions)
r2
0.9207719564437866
pred = pd.DataFrame(predictions,columns=['predictions'])
pred
pred['actual'] = y_test_seq.tolist()
pred

pred['difference'] = pred['actual']-pred['predictions']
pred

import seaborn as sns
xlims =(0,55)
ax = sns.scatterplot(data=pred,x='actual',y='predictions')
ax.plot(xlims,xlims, color='r')
plt.show()

from lets_plot import *
import numpy as np
import pandas as pd

LetsPlot.setup_html()

df = bikes.copy()

if "total" not in df.columns:
    df["total"] = df["casual"] + df["registered"]

df = df.dropna(subset=["temp_c", "casual", "registered", "total"])

bin_w = 2
df["temp_bin"] = (np.floor(df["temp_c"] / bin_w) * bin_w).astype(int)

sum_bin = (
    df.groupby("temp_bin")
      .agg(
          n=("total", "size"),
          mean_casual=("casual", "mean"),
          mean_registered=("registered", "mean"),
          mean_total=("total", "mean"),
      )
      .reset_index()
      .sort_values("temp_bin")
)

plot_df = sum_bin.melt(
    id_vars=["temp_bin", "n"],
    value_vars=["mean_casual", "mean_registered", "mean_total"],
    var_name="series",
    value_name="mean_rides"
)

p_counts = (
    ggplot(plot_df, aes(x="temp_bin", y="mean_rides", color="series"))
    + geom_line(size=1.2)
    + geom_point(size=2.5)
    + labs(
        title=f"Average bike rentals vs temperature (binned to {bin_w}°C)",
        x="Temperature bin (°C, left edge)",
        y="Mean rides per record",
        color="Series"
    )
)

# Bar chart of records per bin (replace geom_col with geom_bar + stat='identity')
p_n = (
    ggplot(sum_bin, aes(x="temp_bin", y="n"))
    + geom_bar(stat="identity")
    + labs(
        title="Records per temperature bin",
        x="Temperature bin (°C, left edge)",
        y="Number of records"
    )
)

display(p_counts)
display(p_n)
original_data = pd.read_csv("https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/bikes.csv")
holdout_mini = pd.read_csv("https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/biking_holdout_test_mini.csv")
original_data["is_holdout"] = 0
holdout_mini["is_holdout"] = 1
holdout_mini.shape
(384, 11)
combined = pd.concat([original_data, holdout_mini], ignore_index=True)
combined["datetime"] = pd.to_datetime(combined["dteday"])
combined_transformed = rawtransformer(combined)

holdout_mini = combined_transformed[
    combined_transformed["is_holdout"] == 1
]
holdout_mini = holdout_mini.drop(columns=["casual","registered","total"])
combined.tail(50)
dteday hr casual registered temp_c feels_like_c hum windspeed weathersit season ... day day_of_week year hour_sin hour_cos dow_sin dow_cos month_sin month_cos time_index
112809 11/28/2023 22.0 NaN NaN -1.3 -5.2 0.4091 11.5 1 4 ... 28 1 2023 -5.000000e-01 8.660254e-01 0.781831 0.623490 -0.5 0.866025 112809
112810 11/28/2023 23.0 NaN NaN -1.7 -5.3 0.4410 9.7 1 4 ... 28 1 2023 -2.588190e-01 9.659258e-01 0.781831 0.623490 -0.5 0.866025 112810
112811 11/29/2023 0.0 NaN NaN -2.2 -6.7 0.4564 13.0 1 4 ... 29 2 2023 0.000000e+00 1.000000e+00 0.974928 -0.222521 -0.5 0.866025 112811
112812 11/29/2023 1.0 NaN NaN -2.8 -6.2 0.4789 8.6 1 4 ... 29 2 2023 2.588190e-01 9.659258e-01 0.974928 -0.222521 -0.5 0.866025 112812
112813 11/29/2023 2.0 NaN NaN -2.9 -7.5 0.4811 12.7 1 4 ... 29 2 2023 5.000000e-01 8.660254e-01 0.974928 -0.222521 -0.5 0.866025 112813
112814 11/29/2023 3.0 NaN NaN -2.9 -7.6 0.4796 13.4 1 4 ... 29 2 2023 7.071068e-01 7.071068e-01 0.974928 -0.222521 -0.5 0.866025 112814
112815 11/29/2023 4.0 NaN NaN -3.3 -6.3 0.5041 7.2 1 4 ... 29 2 2023 8.660254e-01 5.000000e-01 0.974928 -0.222521 -0.5 0.866025 112815
112816 11/29/2023 5.0 NaN NaN -3.6 -7.2 0.5148 9.0 1 4 ... 29 2 2023 9.659258e-01 2.588190e-01 0.974928 -0.222521 -0.5 0.866025 112816
112817 11/29/2023 6.0 NaN NaN -3.7 -6.7 0.5171 7.1 1 4 ... 29 2 2023 1.000000e+00 6.123234e-17 0.974928 -0.222521 -0.5 0.866025 112817
112818 11/29/2023 7.0 NaN NaN -3.8 -6.1 0.5188 5.4 1 4 ... 29 2 2023 9.659258e-01 -2.588190e-01 0.974928 -0.222521 -0.5 0.866025 112818
112819 11/29/2023 8.0 NaN NaN -3.2 -6.1 0.5412 7.1 1 4 ... 29 2 2023 8.660254e-01 -5.000000e-01 0.974928 -0.222521 -0.5 0.866025 112819
112820 11/29/2023 9.0 NaN NaN -1.8 -5.7 0.4608 10.8 1 4 ... 29 2 2023 7.071068e-01 -7.071068e-01 0.974928 -0.222521 -0.5 0.866025 112820
112821 11/29/2023 10.0 NaN NaN -0.1 -3.2 0.4050 9.3 1 4 ... 29 2 2023 5.000000e-01 -8.660254e-01 0.974928 -0.222521 -0.5 0.866025 112821
112822 11/29/2023 11.0 NaN NaN 1.2 -2.2 0.3683 11.2 1 4 ... 29 2 2023 2.588190e-01 -9.659258e-01 0.974928 -0.222521 -0.5 0.866025 112822
112823 11/29/2023 12.0 NaN NaN 2.2 -1.2 0.3669 12.5 1 4 ... 29 2 2023 1.224647e-16 -1.000000e+00 0.974928 -0.222521 -0.5 0.866025 112823
112824 11/29/2023 13.0 NaN NaN 3.3 0.3 0.3075 11.5 1 4 ... 29 2 2023 -2.588190e-01 -9.659258e-01 0.974928 -0.222521 -0.5 0.866025 112824
112825 11/29/2023 14.0 NaN NaN 3.9 0.7 0.3032 13.4 1 4 ... 29 2 2023 -5.000000e-01 -8.660254e-01 0.974928 -0.222521 -0.5 0.866025 112825
112826 11/29/2023 15.0 NaN NaN 4.4 0.9 0.3018 15.5 1 4 ... 29 2 2023 -7.071068e-01 -7.071068e-01 0.974928 -0.222521 -0.5 0.866025 112826
112827 11/29/2023 16.0 NaN NaN 3.9 0.3 0.3168 15.7 1 4 ... 29 2 2023 -8.660254e-01 -5.000000e-01 0.974928 -0.222521 -0.5 0.866025 112827
112828 11/29/2023 17.0 NaN NaN 3.8 0.1 0.3202 15.7 1 4 ... 29 2 2023 -9.659258e-01 -2.588190e-01 0.974928 -0.222521 -0.5 0.866025 112828
112829 11/29/2023 18.0 NaN NaN 2.8 -0.9 0.3593 14.7 1 4 ... 29 2 2023 -1.000000e+00 -1.836970e-16 0.974928 -0.222521 -0.5 0.866025 112829
112830 11/29/2023 19.0 NaN NaN 2.4 -1.2 0.4037 13.5 1 4 ... 29 2 2023 -9.659258e-01 2.588190e-01 0.974928 -0.222521 -0.5 0.866025 112830
112831 11/29/2023 20.0 NaN NaN 2.9 -0.9 0.4129 15.2 2 4 ... 29 2 2023 -8.660254e-01 5.000000e-01 0.974928 -0.222521 -0.5 0.866025 112831
112832 11/29/2023 21.0 NaN NaN 3.6 0.2 0.4205 13.8 2 4 ... 29 2 2023 -7.071068e-01 7.071068e-01 0.974928 -0.222521 -0.5 0.866025 112832
112833 11/29/2023 22.0 NaN NaN 2.9 0.0 0.3895 10.4 1 4 ... 29 2 2023 -5.000000e-01 8.660254e-01 0.974928 -0.222521 -0.5 0.866025 112833
112834 11/29/2023 23.0 NaN NaN 2.5 -1.0 0.4053 13.4 1 4 ... 29 2 2023 -2.588190e-01 9.659258e-01 0.974928 -0.222521 -0.5 0.866025 112834
112835 11/30/2023 0.0 NaN NaN 1.3 -1.4 0.4271 8.7 1 4 ... 30 3 2023 0.000000e+00 1.000000e+00 0.433884 -0.900969 -0.5 0.866025 112835
112836 11/30/2023 1.0 NaN NaN 0.6 -2.2 0.4672 8.7 1 4 ... 30 3 2023 2.588190e-01 9.659258e-01 0.433884 -0.900969 -0.5 0.866025 112836
112837 11/30/2023 2.0 NaN NaN 1.1 -1.0 0.4358 6.7 1 4 ... 30 3 2023 5.000000e-01 8.660254e-01 0.433884 -0.900969 -0.5 0.866025 112837
112838 11/30/2023 3.0 NaN NaN -0.2 -2.8 0.5181 7.4 1 4 ... 30 3 2023 7.071068e-01 7.071068e-01 0.433884 -0.900969 -0.5 0.866025 112838
112839 11/30/2023 4.0 NaN NaN 0.2 -1.6 0.5403 5.4 1 4 ... 30 3 2023 8.660254e-01 5.000000e-01 0.433884 -0.900969 -0.5 0.866025 112839
112840 11/30/2023 5.0 NaN NaN -0.5 -2.2 0.5446 5.1 1 4 ... 30 3 2023 9.659258e-01 2.588190e-01 0.433884 -0.900969 -0.5 0.866025 112840
112841 11/30/2023 6.0 NaN NaN 0.0 -2.1 0.5565 6.3 1 4 ... 30 3 2023 1.000000e+00 6.123234e-17 0.433884 -0.900969 -0.5 0.866025 112841
112842 11/30/2023 7.0 NaN NaN 0.4 0.4 0.5976 0.0 1 4 ... 30 3 2023 9.659258e-01 -2.588190e-01 0.433884 -0.900969 -0.5 0.866025 112842
112843 11/30/2023 8.0 NaN NaN 0.7 -2.4 0.6740 9.7 1 4 ... 30 3 2023 8.660254e-01 -5.000000e-01 0.433884 -0.900969 -0.5 0.866025 112843
112844 11/30/2023 9.0 NaN NaN 3.3 0.3 0.5616 11.9 1 4 ... 30 3 2023 7.071068e-01 -7.071068e-01 0.433884 -0.900969 -0.5 0.866025 112844
112845 11/30/2023 10.0 NaN NaN 5.9 3.2 0.4700 12.7 1 4 ... 30 3 2023 5.000000e-01 -8.660254e-01 0.433884 -0.900969 -0.5 0.866025 112845
112846 11/30/2023 11.0 NaN NaN 8.2 5.8 0.4501 14.3 1 4 ... 30 3 2023 2.588190e-01 -9.659258e-01 0.433884 -0.900969 -0.5 0.866025 112846
112847 11/30/2023 12.0 NaN NaN 10.3 10.3 0.4194 14.3 1 4 ... 30 3 2023 1.224647e-16 -1.000000e+00 0.433884 -0.900969 -0.5 0.866025 112847
112848 11/30/2023 13.0 NaN NaN 11.5 11.5 0.4025 15.0 1 4 ... 30 3 2023 -2.588190e-01 -9.659258e-01 0.433884 -0.900969 -0.5 0.866025 112848
112849 11/30/2023 14.0 NaN NaN 12.3 12.3 0.4065 16.6 1 4 ... 30 3 2023 -5.000000e-01 -8.660254e-01 0.433884 -0.900969 -0.5 0.866025 112849
112850 11/30/2023 15.0 NaN NaN 12.0 12.0 0.4212 16.8 1 4 ... 30 3 2023 -7.071068e-01 -7.071068e-01 0.433884 -0.900969 -0.5 0.866025 112850
112851 11/30/2023 16.0 NaN NaN 11.3 11.3 0.4595 14.6 1 4 ... 30 3 2023 -8.660254e-01 -5.000000e-01 0.433884 -0.900969 -0.5 0.866025 112851
112852 11/30/2023 17.0 NaN NaN 9.4 7.7 0.5245 11.2 1 4 ... 30 3 2023 -9.659258e-01 -2.588190e-01 0.433884 -0.900969 -0.5 0.866025 112852
112853 11/30/2023 18.0 NaN NaN 8.7 6.3 0.5527 15.0 1 4 ... 30 3 2023 -1.000000e+00 -1.836970e-16 0.433884 -0.900969 -0.5 0.866025 112853
112854 11/30/2023 19.0 NaN NaN 8.8 6.9 0.5067 11.6 1 4 ... 30 3 2023 -9.659258e-01 2.588190e-01 0.433884 -0.900969 -0.5 0.866025 112854
112855 11/30/2023 20.0 NaN NaN 7.3 4.7 0.5604 14.2 1 4 ... 30 3 2023 -8.660254e-01 5.000000e-01 0.433884 -0.900969 -0.5 0.866025 112855
112856 11/30/2023 21.0 NaN NaN 7.1 5.0 0.5941 11.2 1 4 ... 30 3 2023 -7.071068e-01 7.071068e-01 0.433884 -0.900969 -0.5 0.866025 112856
112857 11/30/2023 22.0 NaN NaN 6.5 4.3 0.5960 10.9 1 4 ... 30 3 2023 -5.000000e-01 8.660254e-01 0.433884 -0.900969 -0.5 0.866025 112857
112858 11/30/2023 23.0 NaN NaN 6.8 5.0 0.6499 9.0 1 4 ... 30 3 2023 -2.588190e-01 9.659258e-01 0.433884 -0.900969 -0.5 0.866025 112858

50 rows × 26 columns

holdout_mini.info()
<class 'pandas.core.frame.DataFrame'>
Index: 384 entries, 112475 to 112858
Data columns (total 47 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   temp_c                  384 non-null    float64
 1   feels_like_c            384 non-null    float64
 2   hum                     384 non-null    float64
 3   windspeed               384 non-null    float64
 4   weathersit              384 non-null    int64  
 5   season                  384 non-null    int64  
 6   holiday                 384 non-null    int64  
 7   workingday              384 non-null    int64  
 8   is_holdout              384 non-null    int64  
 9   month                   384 non-null    int32  
 10  day                     384 non-null    int32  
 11  year                    384 non-null    int32  
 12  hour_sin                384 non-null    float64
 13  hour_cos                384 non-null    float64
 14  dow_sin                 384 non-null    float64
 15  dow_cos                 384 non-null    float64
 16  month_sin               384 non-null    float64
 17  month_cos               384 non-null    float64
 18  time_index              384 non-null    int64  
 19  mean_roll_year_c        382 non-null    float64
 20  mean_roll_year+5_c      382 non-null    float64
 21  mean_roll_year+24_c     382 non-null    float64
 22  mean_roll_year+168_c    382 non-null    float64
 23  mean_roll_year_r        382 non-null    float64
 24  mean_roll_5_r           382 non-null    float64
 25  mean_roll_24_r          382 non-null    float64
 26  mean_roll_168_r         382 non-null    float64
 27  median_roll_year_r      382 non-null    float64
 28  median_roll_year+5_r    382 non-null    float64
 29  median_roll_year+24_r   382 non-null    float64
 30  median_roll_year+168_r  382 non-null    float64
 31  median_roll_year_c      382 non-null    float64
 32  median_roll_year+5_c    382 non-null    float64
 33  median_roll_year+24_c   382 non-null    float64
 34  median_roll_year+168_c  382 non-null    float64
 35  std_roll_year_c         382 non-null    float64
 36  std_roll_year+5_c       382 non-null    float64
 37  std_roll_year+24_c      382 non-null    float64
 38  std_roll_year+168_c     382 non-null    float64
 39  std_roll_year_r         382 non-null    float64
 40  std_roll_year+5_r       382 non-null    float64
 41  std_roll_year+24_r      382 non-null    float64
 42  std_roll_year+168_r     382 non-null    float64
 43  event_general           384 non-null    bool   
 44  event_political         384 non-null    bool   
 45  event_cycling           384 non-null    bool   
 46  event_cultural          384 non-null    bool   
dtypes: bool(4), float64(34), int32(3), int64(6)
memory usage: 129.0 KB
holdout_mini = holdout_mini[feature_columns]
holdout_mini.shape
(384, 46)
import pandas as pd

holdout_mini = pd.DataFrame(holdout_mini, columns=feature_columns)
holdout_mini = holdout_mini.ffill()
# holdout_mini = pd.read_csv("https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/biking_holdout_test_mini.csv")

# holdout_mini = rawtransformer(holdout_mini)

holdout_mini = scaler.transform(holdout_mini)

holdout_mini_pred = np.round(model.predict(holdout_mini).flatten(),0).astype(int)

holdout_mini_pred = pd.DataFrame(holdout_mini_pred, columns=["predictions"])

holdout_mini_pred.to_csv("team4-module4-predictions.csv", index=False)
12/12 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step 
holdout_mini_pred.shape
(384, 1)
print(np.max(holdout_mini_pred))
print(np.min(holdout_mini_pred))
print(np.mean(holdout_mini_pred))
print("Max real:", np.max(y_test_seq))
print("Min real:", np.min(y_test_seq))
print("Mean real:", np.mean(y_test_seq))
print(type(y_test_seq))
print(type(predictions))
print(y_test_seq.shape)
print(holdout_mini_pred.shape)
print(np.var(y_test_seq))

FINDINGS

  • Weather situation is very important. While registered users are less sensitive to the weather sitaution, it requires to be very low to reach the same situation than casual. Casual’s frequency decreases a lot when it is below 10°C, while registered frequency decreases when it is below -8

  • There is a seasonality pattern: in winter is the lowest, and goes up until fall that at its end decreases.

  • There is a relationship on important events happening on the city (political, riding, bicycles)

  • SPONSOR RIDERS EVENTS!!! TRACK THEM AND DO CAMPAIGNS FOR THEM

  • On weekend/holidays place the bikes on stategic places (nearby tourism or recreational locations)

Task

The user has approved. I will now apply event flags to the bikes DataFrame.

This will involve: 1. Defining a function apply_event_flags that iterates through the provided events list. 2. For each event, it will convert the ‘start’ and ‘end’ times to pandas datetime objects. 3. For each event, it will create boolean masks to identify rows in the bikes DataFrame where ‘datetime’ falls within the event’s start and end times. 4. It will then set the event_general flag to True for all matching rows. 5. Additionally, it will set specific event type flags (event_political, event_cycling, event_cultural) to True for matching rows based on the event’s ‘type’. 6. Finally, this function will be applied to the bikes DataFrame to create the new event flag columns. 7. Display the first few rows of the DataFrame with the new event flags to verify the changes.

Apply Event Flags

Subtask:

Define a function to apply event flags (political, cycling, cultural, and general event) to the bikes DataFrame based on the provided events list and then apply this function to the DataFrame.

Reasoning: First, I’ll convert the ‘dteday’ column to datetime objects and create a ‘total’ column by summing ‘casual’ and ‘registered’ if they don’t already exist. These are crucial preprocessing steps that will be used in the apply_event_flags function. After these initial data preparations, I will define the apply_event_flags function to add event flags (general, political, cycling, cultural) to the DataFrame based on the provided events list.

bikes['total'] = bikes['casual'] + bikes['registered']
bikes['datetime'] = pd.to_datetime(bikes['dteday'])

def apply_event_flags(df, events_list):
    df['event_general'] = False
    df['event_political'] = False
    df['event_cycling'] = False
    df['event_cultural'] = False

    for event in events_list:
        start = pd.Timestamp(event['start'])
        end = pd.Timestamp(event['end'])

        mask = (df['datetime'] >= start) & (df['datetime'] <= end)

        df.loc[mask, 'event_general'] = True

        if event['type'] == 'political':
            df.loc[mask, 'event_political'] = True
        elif event['type'] == 'cycling':
            df.loc[mask, 'event_cycling'] = True
        elif event['type'] == 'cultural':
            df.loc[mask, 'event_cultural'] = True

    return df

bikes = apply_event_flags(bikes, events)
bikes.head()

Reasoning: The ‘bikes’ DataFrame was not defined in the current execution context, causing a NameError. I will first load the ‘bikes’ DataFrame from the provided URL, then proceed with the preprocessing and function application steps as previously intended.

import pandas as pd
import numpy as np

bikes = pd.read_csv("https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/bikes.csv")

bikes['total'] = bikes['casual'] + bikes['registered']
bikes['datetime'] = pd.to_datetime(bikes['dteday'])

def apply_event_flags(df, events_list):
    df['event_general'] = False
    df['event_political'] = False
    df['event_cycling'] = False
    df['event_cultural'] = False

    for event in events_list:
        start = pd.Timestamp(event['start'])
        end = pd.Timestamp(event['end'])

        mask = (df['datetime'] >= start) & (df['datetime'] <= end)

        df.loc[mask, 'event_general'] = True

        if event['type'] == 'political':
            df.loc[mask, 'event_political'] = True
        elif event['type'] == 'cycling':
            df.loc[mask, 'event_cycling'] = True
        elif event['type'] == 'cultural':
            df.loc[mask, 'event_cultural'] = True

    return df

bikes = apply_event_flags(bikes, events)
bikes.head()

Calculate Average Rentals for Events

Subtask:

Calculate the average total bike rentals for each event type and for non-event days.

Reasoning: To calculate the average total bike rentals for each event type and for non-event days, I need to first calculate the mean of the ‘total’ column for each event type flag and for when ‘event_general’ is false. Then I will combine these into a single DataFrame.

event_avg_rentals = pd.DataFrame({
    'event_type': [
        'political',
        'cycling',
        'cultural',
        'general',
        'non_event_general'
    ],
    'average_rentals': [
        bikes[bikes['event_political'] == True]['total'].mean(),
        bikes[bikes['event_cycling'] == True]['total'].mean(),
        bikes[bikes['event_cultural'] == True]['total'].mean(),
        bikes[bikes['event_general'] == True]['total'].mean(),
        bikes[bikes['event_general'] == False]['total'].mean()
    ]
})

display(event_avg_rentals)

Visualize Event Impact

Subtask:

Create a bar plot using lets_plot to visualize the average bike rentals for different event types and non-event days to show the relationship.

Summary:

Q&A

The analysis reveals how different event types and non-event days impact average bike rentals.

Data Analysis Key Findings

  • The bikes DataFrame was successfully augmented with event flags (event_general, event_political, event_cycling, event_cultural) indicating whether a specific day falls within an event period and what type of event it is.
  • Cycling events recorded the highest average total bike rentals at approximately 488.84.
  • Cultural events also showed a strong positive correlation with bike rentals, averaging around 464.73.
  • General events (any event type) had an average of approximately 440.40 rentals, which is higher than non-event days.
  • Non-event days averaged significantly fewer rentals, at approximately 336.98.
  • Political events were associated with the lowest average rentals among all categories, at approximately 283.76, even lower than non-event days.

Insights or Next Steps

  • Events, particularly cycling and cultural, appear to be strong drivers for increased bike rentals, suggesting potential for promotional activities or increased bike availability during these times.
  • Further investigation could explore the specific characteristics or timing of political events that might lead to a decrease in average bike rentals, as they are the only event type showing a negative correlation compared to non-event days.
Back to top